Expert-Domain Evaluation Report v1.0

Section 1

Executive Summary

LLM Consensus was evaluated across 100 expert-domain questions in financial regulation, legal cross-regulatory analysis, clinical medicine, and technical architecture. Using a 3-judge blind evaluation panel, the system achieved 100% non-inferiority against the best individual frontier model, with a 44.9% win rate, a 55.1% tie rate, and a 0% loss rate. No quality degradation was observed in any reliably-adjudicated evaluation.

Section 2

Key Metrics

Primary scorecard based on 78 decisive evaluations (inter-judge agreement ≥ 0.60).

100%

Non-Inferiority

44.9%

Win Rate

55.1%

Tie Rate

0%

Loss Rate

0%

Quality Regression

∞

Win / Loss Ratio

Section 3

Domain Performance

Breakdown across 4 expert verticals. Losses are 0% in every domain.

Clinical Medicine 10 wins / 7 ties · 17 decisive

59%

41%

Financial Regulation 8 wins / 8 ties · 16 decisive

50%

Legal Cross-Regulatory 11 wins / 14 ties · 25 decisive

44%

56%

Technical Architecture 6 wins / 14 ties · 20 decisive

30%

70%

Consensus outperforms best individual

Consensus matches best individual

Section 4

What This Means

For decision-makers

The core finding is simple: LLM Consensus never gives you a worse answer than the best AI model available, and nearly half the time it gives you a better one.

When you ask a complex question to GPT-5.4, Claude Opus, or Gemini individually, you get one expert perspective. Some models are better at certain topics than others, and you don't always know which one to trust. LLM Consensus solves this: it consults all of them, cross-checks their answers, and delivers a single response that captures the best of each.

In our evaluation across 78 expert questions in finance, law, medicine, and technology:

In 45% of cases, the combined answer was measurably better than what any single model produced alone. It caught details that individual models missed, resolved contradictions between them, and produced more complete responses.

In 55% of cases, the combined answer matched the best individual model. No improvement, but no degradation either — you got at least the best available answer.

In 0% of cases did the combined answer make things worse. Zero degradation. When you use LLM Consensus, you never get a worse result than if you had picked the right model on your own — except you don't have to guess which model is "right" for each question.

This matters most in high-stakes domains where being wrong is expensive: regulatory compliance, clinical decisions, legal analysis, and critical architecture choices.

Where it helps most

Clinical medicine (59% improvement): Drug interactions, multi-comorbidity management, guideline interpretation — areas where combining 3 clinical perspectives catches what any single model might miss.

Financial regulation (50% improvement): Cross-framework questions involving DORA, PSD2, SFDR, Basel — where no single model consistently covers all applicable regulations.

Legal analysis (44% improvement): Multi-jurisdictional and cross-regulatory questions where synthesizing different legal frameworks produces more complete answers.

Technical architecture (30% improvement, 70% match): System design decisions where the consensus consistently matches the best expert recommendation.

Section 5

Methodology Overview

System under test

LLM Consensus Deep Mode orchestrates 5 frontier models with specialist roles, iterative synthesis, cross-verification, and independent validator review.

Role	Model	Provider
Core Model 1	GPT-5.4	OpenAI
Core Model 2	Claude Opus 4.6	Anthropic
Core Model 3	Gemini 3.1 Pro	Google
Validator 1	Mistral Large 2	Mistral AI
Validator 2	Llama 3.3 70B	Meta / Together AI

Judge panel

3-judge multi-vendor panel with blind evaluation. All 4 responses (consensus + 3 individuals) labeled A/B/C/D with randomized order, deterministic per question (MD5 hash seed).

Judge	Provider
Claude Sonnet 4.6	Anthropic
GPT-4.1	OpenAI
Gemini 2.5 Pro	Google

Scoring protocol

Each response scored on two independent dimensions via separate judge calls (6 calls per question: 3 factual + 3 quality):

Factual score (0–1): Evaluated against a binary checklist of 8–12 verifiable facts per question. Score = fraction of facts correctly covered.

Quality score (0–1): Holistic evaluation of clarity, structure, depth of reasoning, actionability, and practical usefulness.

Combined metric & outcome determination

                combined_score = 0.6 × factual_score + 0.4 × quality_score
            

WIN: consensus combined > best individual + 2.5%

TIE: difference within ±2.5%

LOSS: consensus combined < best individual − 2.5%

INCONCLUSIVE: inter-judge agreement (Pearson r) < 0.60

The 2.5% threshold is derived from minimum detectable effect size analysis given observed inter-judge variance.

Section 6

Evaluation Completeness

Category	Count	Treatment
Decisive (win / tie / loss)	78	Scored — reported in Key Metrics
Inconclusive (IJA < 0.60)	22	Excluded — judges did not reach sufficient agreement
Total evaluated	100	25 per domain

22 questions were excluded from scoring because the 3-judge panel did not reach sufficient agreement (inter-judge Pearson r < 0.60). These are not counted as wins, ties, or losses. Full raw scores for all 100 questions, including inconclusive, are available in the public dataset for independent verification.

Section 7

Technical Observations

The system achieves weak dominance over the best individual model: consensus ≥ max(individual) in 100% of decisive evaluations, with strict improvement in 44.9%.

Non-inferiority is robust across domains. The zero loss rate holds independently in all 4 evaluation domains (financial, legal, medical, technical), not just in aggregate. This suggests a systematic property of the orchestration architecture, not a statistical artifact.
Wins concentrate in synthesis-heavy tasks. The 59% win rate in medical and 50% in financial correlate with questions requiring integration of 3+ knowledge sources (clinical guidelines, regulatory frameworks). In technical architecture, where questions often have a single defensible answer, consensus matches the best individual (70% tie) but adds less incremental value (30% win).
All measured degradation is zero, but with caveats. The 22% inconclusive rate (questions excluded due to low inter-judge agreement) means the evaluation coverage is 78/100. The zero loss result holds on the evaluable subset; the inconclusive questions are published in full for independent analysis. We do not claim zero degradation on the inconclusive set.
The evaluation measures combined factual + quality signal. The combined metric (0.6 factual + 0.4 quality) reflects that users care about both accuracy and usefulness. Under a factual-only metric, the win rate is lower and a small number of ties would shift to marginal losses. Under a quality-only metric, the win rate is higher. Raw scores for both dimensions are published for reanalysis under alternative weightings.
The "best individual" baseline is conservative. For each question, the baseline is the single real model with the highest combined score — not a synthetic "oracle" combining the best factual score from one model with the best quality score from another. This makes every win a win against an actual, achievable individual model response.
Provider reliability is a separate concern. This evaluation measures answer quality on successfully completed orchestrations. Operational metrics (timeout rates, provider availability) are tracked separately and do not affect the quality scorecard.

Section 8

Known Limitations

Product-aligned benchmark. Questions are designed to evaluate synthesis-heavy expert tasks — the type of task where multi-model consensus has a theoretical advantage. This is a system benchmark, not a general-purpose LLM comparison.
22% inconclusive rate. Multi-judge agreement is challenging for some expert questions, particularly in financial and medical domains where clinical/regulatory nuance creates legitimate disagreement between judges.
Point-in-time evaluation. Conducted during March 2026 using specific model versions. Results may vary with model updates.
Combined metric weighting. The 0.6 factual + 0.4 quality weighting reflects a design choice. Alternative weightings are available in the raw data for independent analysis.

Section 9

Data Availability

Full raw scores for all 100 questions — including inconclusive evaluations — are available for independent verification.

JSON dataset-v1.json

CSV dataset-v1.csv

Includes: per-judge scores, randomization seeds, checklist items, model versions, and leave-one-judge-out recalculations.