100 questions. 4 expert domains. 3 independent judges. Zero degradation.
78 decisive evaluations across 100 expert-domain questions
25 questions per domain, scored on factual accuracy + quality
Evaluation completeness: 78 of 100 questions were decisive (judges reached sufficient agreement). 22 questions were excluded because the 3-judge panel did not reach sufficient agreement (inter-judge Pearson r < 0.60). These are not counted as wins, ties, or losses. Full raw scores for all 100 questions, including inconclusive, are available in the public dataset.
For decision-makers (non-technical)
The core finding is simple: LLM Consensus never gives you a worse answer than the best AI model available, and nearly half the time it gives you a better one.
When you ask a complex question to GPT-5.4, Claude Opus, or Gemini individually, you get one expert perspective. Some models are better at certain topics than others, and you don't always know which one to trust. LLM Consensus solves this: it consults all of them, cross-checks their answers, and delivers a single response that captures the best of each.
In our evaluation across 78 expert questions in finance, law, medicine, and technology:
This matters most in high-stakes domains where being wrong is expensive: regulatory compliance, clinical decisions, legal analysis, and critical architecture choices.
Where it helps most:
LLM Consensus Deep Mode orchestrates 5 frontier models with specialist roles
The consensus response is compared against the single best-performing individual model for each question — not a synthetic composite.
Models (orchestrated)
Judges (independent, blind)
Transparent evaluation protocol with blind scoring and multi-vendor judges
Three frontier models from three different providers serve as independent judges:
| Judge | Provider |
|---|---|
| Claude Sonnet 4.6 | Anthropic |
| GPT-4.1 | OpenAI |
| Gemini 2.5 Pro |
Multi-vendor panel with one judge per provider, same capability tier. Leave-one-judge-out analysis is available in the dataset to verify that results are robust to the removal of any single judge.
All 4 responses (consensus + 3 individuals) are labeled A/B/C/D with randomized order, deterministic per question (MD5 hash of question ID as seed). Judges do not know which response is the consensus.
If consensus returns a response text-identical to the best individual (after whitespace normalization), the question is automatically scored as TIE to prevent evaluation noise from scoring the same text differently under two labels.
Each response is scored on two independent dimensions via separate judge calls (6 judge calls per question: 3 factual + 3 quality):
Combined score formula:
combined = 0.6 × factual + 0.4 × quality
Outcome rules with a 2.5% non-inferiority threshold (derived from minimum detectable effect size analysis given observed inter-judge variance):
22 questions (22%) were excluded from scoring because the 3-judge panel did not reach sufficient agreement (inter-judge Pearson r < 0.60). These are not counted as wins, ties, or losses.
Multi-judge agreement is challenging for some expert questions, particularly in financial and medical domains where clinical/regulatory nuance creates legitimate disagreement between judges.
Full raw scores for all 100 questions, including inconclusive, are available in the public dataset for independent verification.
100 expert-domain questions across 4 verticals (25 per domain):
All questions require synthesis across 2+ regulatory frameworks, clinical guidelines, or technical constraints. Each question includes a binary checklist of 8-12 verifiable facts with primary source references.
We believe in transparent reporting
Open data for independent verification and analysis
Test multi-model consensus on your own questions, or explore the API documentation.