Primary Scorecard

78 decisive evaluations across 100 expert-domain questions

100%

Non-Inferiority Rate

Consensus never underperforms

44.9%

Win Rate

Consensus outperforms best individual

55.1%

Tie Rate

Consensus matches best individual

0.0%

Loss Rate

Zero observed degradation

∞

Win / Loss Ratio

35 wins, 0 losses

0%

Quality Regression

No quality loss in any question

Results by Domain

25 questions per domain, scored on factual accuracy + quality

Domain	ID Range	Decisive	Wins	Ties	Win Rate
Financial Regulation	FIN-01 to FIN-25	16	8 (50%)	8 (50%)	50%
Legal Cross-Regulatory	LEG-01 to LEG-25	25	11 (44%)	14 (56%)	44%
Clinical Medicine	MED-01 to MED-25	17	10 (59%)	7 (41%)	59%
Technical Architecture	TECH-01 to TECH-25	20	6 (30%)	14 (70%)	30%

Financial Regulation 16 decisive

50%

Win Tie

Legal Cross-Regulatory 25 decisive

44%

56%

Win Tie

Clinical Medicine 17 decisive

59%

41%

Win Tie

Technical Architecture 20 decisive

30%

70%

Win Tie

Evaluation completeness: 78 of 100 questions were decisive (judges reached sufficient agreement). 22 questions were excluded because the 3-judge panel did not reach sufficient agreement (inter-judge Pearson r < 0.60). These are not counted as wins, ties, or losses. Full raw scores for all 100 questions, including inconclusive, are available in the public dataset.

What This Means

For decision-makers (non-technical)

The core finding is simple: LLM Consensus never gives you a worse answer than the best AI model available, and nearly half the time it gives you a better one.

When you ask a complex question to GPT-5.4, Claude Opus, or Gemini individually, you get one expert perspective. Some models are better at certain topics than others, and you don't always know which one to trust. LLM Consensus solves this: it consults all of them, cross-checks their answers, and delivers a single response that captures the best of each.

In our evaluation across 78 expert questions in finance, law, medicine, and technology:

In 45% of cases, the combined answer was measurably better than what any single model produced alone. It caught details that individual models missed, resolved contradictions between them, and produced more complete responses.
In 55% of cases, the combined answer matched the best individual model. No improvement, but no degradation either — you got at least the best available answer.
In 0% of cases did the combined answer make things worse. Zero degradation. When you use LLM Consensus, you never get a worse result than if you had picked the right model on your own — except you don't have to guess which model is "right" for each question.

This matters most in high-stakes domains where being wrong is expensive: regulatory compliance, clinical decisions, legal analysis, and critical architecture choices.

Where it helps most:

Clinical medicine (59% improvement): Drug interactions, multi-comorbidity management, guideline interpretation — areas where combining 3 clinical perspectives catches what any single model might miss.
Financial regulation (50% improvement): Cross-framework questions involving DORA, PSD2, SFDR, Basel — where no single model consistently covers all applicable regulations.
Legal analysis (44% improvement): Multi-jurisdictional and cross-regulatory questions where synthesizing different legal frameworks produces more complete answers.
Technical architecture (30% improvement, 70% match): System design decisions where the consensus consistently matches the best expert recommendation.

System Under Test

LLM Consensus Deep Mode orchestrates 5 frontier models with specialist roles

The consensus response is compared against the single best-performing individual model for each question — not a synthetic composite.

Models (orchestrated)

GPT-5.4 Claude Opus 4.6 Gemini 3.1 Pro Mistral Large 2 Llama 3.3 70B LLM Consensus (all five)

Judges (independent, blind)

Claude Sonnet 4.6 GPT-4.1 Gemini 2.5 Pro

Methodology

Transparent evaluation protocol with blind scoring and multi-vendor judges

Three frontier models from three different providers serve as independent judges:

Judge	Provider
Claude Sonnet 4.6	Anthropic
GPT-4.1	OpenAI
Gemini 2.5 Pro	Google

Multi-vendor panel with one judge per provider, same capability tier. Leave-one-judge-out analysis is available in the dataset to verify that results are robust to the removal of any single judge.

All 4 responses (consensus + 3 individuals) are labeled A/B/C/D with randomized order, deterministic per question (MD5 hash of question ID as seed). Judges do not know which response is the consensus.

If consensus returns a response text-identical to the best individual (after whitespace normalization), the question is automatically scored as TIE to prevent evaluation noise from scoring the same text differently under two labels.

Each response is scored on two independent dimensions via separate judge calls (6 judge calls per question: 3 factual + 3 quality):

Factual score (0-1): Evaluated against a binary checklist of verifiable facts. Each checklist item is scored as present or absent. Score = fraction of facts correctly covered.
Quality score (0-1): Holistic evaluation of clarity, structure, depth of reasoning, actionability, and practical usefulness. Scored independently from factual accuracy.

Combined score formula:

combined = 0.6 × factual + 0.4 × quality

Outcome rules with a 2.5% non-inferiority threshold (derived from minimum detectable effect size analysis given observed inter-judge variance):

WIN: consensus combined > best individual combined + 2.5%
TIE: difference within ±2.5%
LOSS: consensus combined < best individual combined − 2.5%
INCONCLUSIVE: inter-judge agreement (Pearson r) < 0.60

22 questions (22%) were excluded from scoring because the 3-judge panel did not reach sufficient agreement (inter-judge Pearson r < 0.60). These are not counted as wins, ties, or losses.

Multi-judge agreement is challenging for some expert questions, particularly in financial and medical domains where clinical/regulatory nuance creates legitimate disagreement between judges.

Full raw scores for all 100 questions, including inconclusive, are available in the public dataset for independent verification.

100 expert-domain questions across 4 verticals (25 per domain):

Financial regulation (FIN-01 to FIN-25): EU regulatory frameworks including DORA, PSD2, SFDR, MiFID II, EMIR, Basel III/IV, BRRD, MiCA, AIFMD, CRR/CRD
Legal cross-regulatory (LEG-01 to LEG-25): Multi-jurisdictional and cross-framework analysis including GDPR, AI Act, DSA, NIS2, MDR, EHDS, ePrivacy Directive, Rome I
Clinical medicine (MED-01 to MED-25): Multi-comorbidity management, drug interactions, pharmacokinetics, clinical guidelines (Maastricht VI, SSC 2021, ESC 2023, GOLD 2023, STOPP/START)
Technical architecture (TECH-01 to TECH-25): System design with hard constraints, trade-off resolution, security architecture, database optimization, distributed systems

All questions require synthesis across 2+ regulatory frameworks, clinical guidelines, or technical constraints. Each question includes a binary checklist of 8-12 verifiable facts with primary source references.

Known Limitations

We believe in transparent reporting

1

Product-aligned benchmark. Questions are designed to evaluate synthesis-heavy expert tasks — the type of task where multi-model consensus has a theoretical advantage. This is a system benchmark, not a general-purpose LLM comparison.

2

22% inconclusive rate. Multi-judge agreement is challenging for some expert questions, particularly in financial and medical domains where clinical/regulatory nuance creates legitimate disagreement between judges.

3

Point-in-time evaluation. Conducted during March 2026 using specific model versions. Results may vary with model updates.

4

Combined metric weighting. The 0.6 factual + 0.4 quality weighting reflects a design choice. Alternative weightings are available in the raw data for independent analysis.

Download Dataset

Open data for independent verification and analysis

{ }

Benchmark Dataset

dataset-v1.json

100 questions with per-judge raw scores, combined metrics, outcomes, and inter-judge agreement for all 4 domains. Includes both decisive and inconclusive questions.

Download JSON

☰

Full Report

report-v1.html

Professional branded report with methodology, results, domain breakdown, and technical analysis. View in browser or print to PDF.

View Report

▦

Spreadsheet

dataset-v1.csv

100 questions with scores, deltas, and outcomes in CSV format. Ready for Excel, Google Sheets, or any data analysis tool.

Download CSV

✉

Press release

press-release-v1.html

Official announcement with key findings, domain performance, methodology summary, and data availability. Ready for media distribution.

View press release

Expert-Domain Evaluation
Benchmark v1.0

Primary Scorecard

Results by Domain

What This Means

System Under Test

Methodology

Known Limitations

Download Dataset

Benchmark Dataset

Full Report

Spreadsheet

Press release

Try It Yourself

Expert-Domain EvaluationBenchmark v1.0

Primary Scorecard

Results by Domain

What This Means

System Under Test

Methodology

Known Limitations

Download Dataset

Benchmark Dataset

Full Report

Spreadsheet

Press release

Try It Yourself

Expert-Domain Evaluation
Benchmark v1.0