Published April 1, 2026 — Open Dataset

Expert-Domain Evaluation
Benchmark v1.0

100 questions. 4 expert domains. 3 independent judges. Zero degradation.

100%
Non-Inferiority
44.9%
Win Rate
0%
Loss Rate

Primary Scorecard

78 decisive evaluations across 100 expert-domain questions

100%
Non-Inferiority Rate
Consensus never underperforms
44.9%
Win Rate
Consensus outperforms best individual
55.1%
Tie Rate
Consensus matches best individual
0.0%
Loss Rate
Zero observed degradation
Win / Loss Ratio
35 wins, 0 losses
0%
Quality Regression
No quality loss in any question

Results by Domain

25 questions per domain, scored on factual accuracy + quality

Domain ID Range Decisive Wins Ties Losses Win Rate
Financial Regulation FIN-01 to FIN-25 16 8 (50%) 8 (50%) 0 50%
Legal Cross-Regulatory LEG-01 to LEG-25 25 11 (44%) 14 (56%) 0 44%
Clinical Medicine MED-01 to MED-25 17 10 (59%) 7 (41%) 0 59%
Technical Architecture TECH-01 to TECH-25 20 6 (30%) 14 (70%) 0 30%
Financial Regulation 16 decisive
50%
50%
Win Tie
Legal Cross-Regulatory 25 decisive
44%
56%
Win Tie
Clinical Medicine 17 decisive
59%
41%
Win Tie
Technical Architecture 20 decisive
30%
70%
Win Tie

Evaluation completeness: 78 of 100 questions were decisive (judges reached sufficient agreement). 22 questions were excluded because the 3-judge panel did not reach sufficient agreement (inter-judge Pearson r < 0.60). These are not counted as wins, ties, or losses. Full raw scores for all 100 questions, including inconclusive, are available in the public dataset.


What This Means

For decision-makers (non-technical)

The core finding is simple: LLM Consensus never gives you a worse answer than the best AI model available, and nearly half the time it gives you a better one.

When you ask a complex question to GPT-5.4, Claude Opus, or Gemini individually, you get one expert perspective. Some models are better at certain topics than others, and you don't always know which one to trust. LLM Consensus solves this: it consults all of them, cross-checks their answers, and delivers a single response that captures the best of each.

In our evaluation across 78 expert questions in finance, law, medicine, and technology:

  • In 45% of cases, the combined answer was measurably better than what any single model produced alone. It caught details that individual models missed, resolved contradictions between them, and produced more complete responses.
  • In 55% of cases, the combined answer matched the best individual model. No improvement, but no degradation either — you got at least the best available answer.
  • In 0% of cases did the combined answer make things worse. Zero degradation. When you use LLM Consensus, you never get a worse result than if you had picked the right model on your own — except you don't have to guess which model is "right" for each question.

This matters most in high-stakes domains where being wrong is expensive: regulatory compliance, clinical decisions, legal analysis, and critical architecture choices.

Where it helps most:

  • Clinical medicine (59% improvement): Drug interactions, multi-comorbidity management, guideline interpretation — areas where combining 3 clinical perspectives catches what any single model might miss.
  • Financial regulation (50% improvement): Cross-framework questions involving DORA, PSD2, SFDR, Basel — where no single model consistently covers all applicable regulations.
  • Legal analysis (44% improvement): Multi-jurisdictional and cross-regulatory questions where synthesizing different legal frameworks produces more complete answers.
  • Technical architecture (30% improvement, 70% match): System design decisions where the consensus consistently matches the best expert recommendation.

System Under Test

LLM Consensus Deep Mode orchestrates 5 frontier models with specialist roles

The consensus response is compared against the single best-performing individual model for each question — not a synthetic composite.

Models (orchestrated)

GPT-5.4 Claude Opus 4.6 Gemini 3.1 Pro Mistral Large 2 Llama 3.3 70B LLM Consensus (all five)

Judges (independent, blind)

Claude Sonnet 4.6 GPT-4.1 Gemini 2.5 Pro

Methodology

Transparent evaluation protocol with blind scoring and multi-vendor judges

Three frontier models from three different providers serve as independent judges:

JudgeProvider
Claude Sonnet 4.6Anthropic
GPT-4.1OpenAI
Gemini 2.5 ProGoogle

Multi-vendor panel with one judge per provider, same capability tier. Leave-one-judge-out analysis is available in the dataset to verify that results are robust to the removal of any single judge.

All 4 responses (consensus + 3 individuals) are labeled A/B/C/D with randomized order, deterministic per question (MD5 hash of question ID as seed). Judges do not know which response is the consensus.

If consensus returns a response text-identical to the best individual (after whitespace normalization), the question is automatically scored as TIE to prevent evaluation noise from scoring the same text differently under two labels.

Each response is scored on two independent dimensions via separate judge calls (6 judge calls per question: 3 factual + 3 quality):

  • Factual score (0-1): Evaluated against a binary checklist of verifiable facts. Each checklist item is scored as present or absent. Score = fraction of facts correctly covered.
  • Quality score (0-1): Holistic evaluation of clarity, structure, depth of reasoning, actionability, and practical usefulness. Scored independently from factual accuracy.

Combined score formula:

combined = 0.6 × factual + 0.4 × quality

Outcome rules with a 2.5% non-inferiority threshold (derived from minimum detectable effect size analysis given observed inter-judge variance):

  • WIN: consensus combined > best individual combined + 2.5%
  • TIE: difference within ±2.5%
  • LOSS: consensus combined < best individual combined − 2.5%
  • INCONCLUSIVE: inter-judge agreement (Pearson r) < 0.60

22 questions (22%) were excluded from scoring because the 3-judge panel did not reach sufficient agreement (inter-judge Pearson r < 0.60). These are not counted as wins, ties, or losses.

Multi-judge agreement is challenging for some expert questions, particularly in financial and medical domains where clinical/regulatory nuance creates legitimate disagreement between judges.

Full raw scores for all 100 questions, including inconclusive, are available in the public dataset for independent verification.

100 expert-domain questions across 4 verticals (25 per domain):

  • Financial regulation (FIN-01 to FIN-25): EU regulatory frameworks including DORA, PSD2, SFDR, MiFID II, EMIR, Basel III/IV, BRRD, MiCA, AIFMD, CRR/CRD
  • Legal cross-regulatory (LEG-01 to LEG-25): Multi-jurisdictional and cross-framework analysis including GDPR, AI Act, DSA, NIS2, MDR, EHDS, ePrivacy Directive, Rome I
  • Clinical medicine (MED-01 to MED-25): Multi-comorbidity management, drug interactions, pharmacokinetics, clinical guidelines (Maastricht VI, SSC 2021, ESC 2023, GOLD 2023, STOPP/START)
  • Technical architecture (TECH-01 to TECH-25): System design with hard constraints, trade-off resolution, security architecture, database optimization, distributed systems

All questions require synthesis across 2+ regulatory frameworks, clinical guidelines, or technical constraints. Each question includes a binary checklist of 8-12 verifiable facts with primary source references.


Known Limitations

We believe in transparent reporting

1
Product-aligned benchmark. Questions are designed to evaluate synthesis-heavy expert tasks — the type of task where multi-model consensus has a theoretical advantage. This is a system benchmark, not a general-purpose LLM comparison.
2
22% inconclusive rate. Multi-judge agreement is challenging for some expert questions, particularly in financial and medical domains where clinical/regulatory nuance creates legitimate disagreement between judges.
3
Point-in-time evaluation. Conducted during March 2026 using specific model versions. Results may vary with model updates.
4
Combined metric weighting. The 0.6 factual + 0.4 quality weighting reflects a design choice. Alternative weightings are available in the raw data for independent analysis.

Download Dataset

Open data for independent verification and analysis

{ }

Benchmark Dataset

dataset-v1.json
100 questions with per-judge raw scores, combined metrics, outcomes, and inter-judge agreement for all 4 domains. Includes both decisive and inconclusive questions.
Download JSON

Full Report

report-v1.html
Professional branded report with methodology, results, domain breakdown, and technical analysis. View in browser or print to PDF.
View Report

Spreadsheet

dataset-v1.csv
100 questions with scores, deltas, and outcomes in CSV format. Ready for Excel, Google Sheets, or any data analysis tool.
Download CSV

Press release

press-release-v1.html
Official announcement with key findings, domain performance, methodology summary, and data availability. Ready for media distribution.
View press release

Try It Yourself

Test multi-model consensus on your own questions, or explore the API documentation.