LLM Consensus was evaluated across 100 expert-domain questions in financial regulation, legal cross-regulatory analysis, clinical medicine, and technical architecture. Using a 3-judge blind evaluation panel, the system achieved 100% non-inferiority against the best individual frontier model, with a 44.9% win rate, a 55.1% tie rate, and a 0% loss rate. No quality degradation was observed in any reliably-adjudicated evaluation.
Primary scorecard based on 78 decisive evaluations (inter-judge agreement ≥ 0.60).
Breakdown across 4 expert verticals. Losses are 0% in every domain.
The core finding is simple: LLM Consensus never gives you a worse answer than the best AI model available, and nearly half the time it gives you a better one.
When you ask a complex question to GPT-5.4, Claude Opus, or Gemini individually, you get one expert perspective. Some models are better at certain topics than others, and you don't always know which one to trust. LLM Consensus solves this: it consults all of them, cross-checks their answers, and delivers a single response that captures the best of each.
In our evaluation across 78 expert questions in finance, law, medicine, and technology:
In 45% of cases, the combined answer was measurably better than what any single model produced alone. It caught details that individual models missed, resolved contradictions between them, and produced more complete responses.
In 55% of cases, the combined answer matched the best individual model. No improvement, but no degradation either — you got at least the best available answer.
In 0% of cases did the combined answer make things worse. Zero degradation. When you use LLM Consensus, you never get a worse result than if you had picked the right model on your own — except you don't have to guess which model is "right" for each question.
This matters most in high-stakes domains where being wrong is expensive: regulatory compliance, clinical decisions, legal analysis, and critical architecture choices.
Clinical medicine (59% improvement): Drug interactions, multi-comorbidity management, guideline interpretation — areas where combining 3 clinical perspectives catches what any single model might miss.
Financial regulation (50% improvement): Cross-framework questions involving DORA, PSD2, SFDR, Basel — where no single model consistently covers all applicable regulations.
Legal analysis (44% improvement): Multi-jurisdictional and cross-regulatory questions where synthesizing different legal frameworks produces more complete answers.
Technical architecture (30% improvement, 70% match): System design decisions where the consensus consistently matches the best expert recommendation.
LLM Consensus Deep Mode orchestrates 5 frontier models with specialist roles, iterative synthesis, cross-verification, and independent validator review.
| Role | Model | Provider |
|---|---|---|
| Core Model 1 | GPT-5.4 | OpenAI |
| Core Model 2 | Claude Opus 4.6 | Anthropic |
| Core Model 3 | Gemini 3.1 Pro | |
| Validator 1 | Mistral Large 2 | Mistral AI |
| Validator 2 | Llama 3.3 70B | Meta / Together AI |
3-judge multi-vendor panel with blind evaluation. All 4 responses (consensus + 3 individuals) labeled A/B/C/D with randomized order, deterministic per question (MD5 hash seed).
| Judge | Provider |
|---|---|
| Claude Sonnet 4.6 | Anthropic |
| GPT-4.1 | OpenAI |
| Gemini 2.5 Pro |
Each response scored on two independent dimensions via separate judge calls (6 calls per question: 3 factual + 3 quality):
Factual score (0–1): Evaluated against a binary checklist of 8–12 verifiable facts per question. Score = fraction of facts correctly covered.
Quality score (0–1): Holistic evaluation of clarity, structure, depth of reasoning, actionability, and practical usefulness.
WIN: consensus combined > best individual + 2.5%
TIE: difference within ±2.5%
LOSS: consensus combined < best individual − 2.5%
INCONCLUSIVE: inter-judge agreement (Pearson r) < 0.60
The 2.5% threshold is derived from minimum detectable effect size analysis given observed inter-judge variance.
| Category | Count | Treatment |
|---|---|---|
| Decisive (win / tie / loss) | 78 | Scored — reported in Key Metrics |
| Inconclusive (IJA < 0.60) | 22 | Excluded — judges did not reach sufficient agreement |
| Total evaluated | 100 | 25 per domain |
22 questions were excluded from scoring because the 3-judge panel did not reach sufficient agreement (inter-judge Pearson r < 0.60). These are not counted as wins, ties, or losses. Full raw scores for all 100 questions, including inconclusive, are available in the public dataset for independent verification.
The system achieves weak dominance over the best individual model: consensus ≥ max(individual) in 100% of decisive evaluations, with strict improvement in 44.9%.
Full raw scores for all 100 questions — including inconclusive evaluations — are available for independent verification.
Includes: per-judge scores, randomization seeds, checklist items, model versions, and leave-one-judge-out recalculations.