# LLM Consensus Expert-Domain Evaluation v1.0

*A production-grounded evaluation suite for legal, financial, medical, and technical expert synthesis*

**Date:** March 31, 2026
**System:** LLM Consensus Deep Mode v10
**Entity:** Healthtech Capital LLC

---

## Headline

**LLM Consensus equals or outperforms the best evaluated individual model (GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro) in 100% of reliably-adjudicated expert questions, with a 44.9% improvement rate and zero observed degradation.**

---

## Results

### Primary Scorecard (78 decisive evaluations)

| Metric | Value |
|---|---|
| Non-inferiority rate | **100%** |
| Win rate (consensus outperforms best individual) | **44.9%** |
| Tie rate (consensus matches best individual) | **55.1%** |
| Loss rate (consensus underperforms) | **0.0%** |
| Quality regression rate | **0.0%** |
| Win/loss ratio | **∞** |

### By Domain

| Domain | ID range | Decisive | Wins | Ties | Losses | Win rate |
|---|---|---|---|---|---|---|
| Financial regulation | FIN-01 to FIN-25 | 16 | 8 (50%) | 8 (50%) | 0 | 50% |
| Legal cross-regulatory | LEG-01 to LEG-25 | 25 | 11 (44%) | 14 (56%) | 0 | 44% |
| Clinical medicine | MED-01 to MED-25 | 17 | 10 (59%) | 7 (41%) | 0 | 59% |
| Technical architecture | TECH-01 to TECH-25 | 20 | 6 (30%) | 14 (70%) | 0 | 30% |

### Evaluation Completeness

| Category | Count | Treatment |
|---|---|---|
| Decisive (win/tie/loss) | 78 | Scored — reported above |
| Inconclusive (IJA < 0.60) | 22 | Excluded — judges did not reach sufficient agreement |
| Total evaluated | 100 | 25 per domain |

22 questions were excluded from scoring because the 3-judge panel did not reach sufficient agreement (inter-judge Pearson r < 0.60). These are not counted as wins, ties, or losses. Full raw scores for all 100 questions, including inconclusive, are available in the public dataset for independent verification.

---

## What Was Evaluated

### System Under Test

LLM Consensus Deep Mode orchestrates 5 frontier AI models with specialist roles, iterative synthesis, cross-verification, and independent validator review:

| Role | Model | Provider |
|---|---|---|
| Core model 1 | GPT-5.4 | OpenAI |
| Core model 2 | Claude Opus 4.6 | Anthropic |
| Core model 3 | Gemini 3.1 Pro | Google |
| Validator 1 | Mistral Large 2 | Mistral AI |
| Validator 2 | Llama 3.3 70B | Meta / Together AI |

### Baseline

Each of the 3 core models individually (GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro). The baseline for each question is the **single best-performing individual model** by combined score — not an artificial composite of best factual from one model and best quality from another.

### Question Bank

100 expert-domain questions across 4 verticals (25 per domain):

- **Financial regulation** (FIN-01 to FIN-25): EU regulatory frameworks including DORA, PSD2, SFDR, MiFID II, EMIR, Basel III/IV, BRRD, MiCA, AIFMD, CRR/CRD
- **Legal cross-regulatory** (LEG-01 to LEG-25): Multi-jurisdictional and cross-framework analysis including GDPR, AI Act, DSA, NIS2, MDR, EHDS, ePrivacy Directive, Rome I
- **Clinical medicine** (MED-01 to MED-25): Multi-comorbidity management, drug interactions, pharmacokinetics, clinical guidelines (Maastricht VI, SSC 2021, ESC 2023, GOLD 2023, STOPP/START)
- **Technical architecture** (TECH-01 to TECH-25): System design with hard constraints, trade-off resolution, security architecture, database optimization, distributed systems

All questions require synthesis across 2+ regulatory frameworks, clinical guidelines, or technical constraints. Each question includes a binary checklist of 8-12 verifiable facts with primary source references.

---

## Evaluation Methodology

### 3-Judge Multi-Vendor Panel (Blind Evaluation)

| Judge | Provider | Role |
|---|---|---|
| Claude Sonnet 4.6 | Anthropic | Judge 1 |
| GPT-4.1 | OpenAI | Judge 2 |
| Gemini 2.5 Pro | Google | Judge 3 |

Multi-vendor judge panel with one judge per provider, same capability tier. Leave-one-judge-out analysis is available in the dataset to verify that results are robust to the removal of any single judge.

### Blind Protocol

All 4 responses (consensus + 3 individuals) are labeled A/B/C/D with randomized order, deterministic per question (MD5 hash of question ID as seed). Judges do not know which response is consensus.

### Dual-Dimension Scoring

Each response is scored on two independent dimensions via separate judge calls (6 judge calls per question: 3 factual + 3 quality):

- **Factual score** (0-1): Evaluated against a binary checklist of verifiable facts. Each checklist item is scored as present or absent. Score = fraction of facts correctly covered.
- **Quality score** (0-1): Holistic evaluation of clarity, structure, depth of reasoning, actionability, and practical usefulness. Scored independently from factual accuracy.

### Combined Metric

`combined_score = 0.6 × factual_score + 0.4 × quality_score`

### Outcome Determination

- **WIN**: consensus combined > best individual combined + 2.5%
- **TIE**: difference within ±2.5%
- **LOSS**: consensus combined < best individual combined − 2.5%
- **INCONCLUSIVE**: inter-judge agreement (Pearson r) < 0.60

The 2.5% threshold is derived from minimum detectable effect size analysis given observed inter-judge variance.

### Duplicate Detection

If consensus returns a response text-identical to the best individual (after whitespace normalization), the question is automatically scored as TIE to prevent evaluation noise from scoring the same text differently under two labels.

---

## Transparency & Reproducibility

### Published with Results

- Full question prompts and binary checklists (public split)
- Raw per-judge scores for all 100 questions, including inconclusive
- Judge prompts, randomization seeds, and scoring code
- Model versions and execution dates
- Exclusion rules with justification
- Leave-one-judge-out recalculations

### Known Limitations

1. **Product-aligned benchmark.** Questions are designed to evaluate synthesis-heavy expert tasks — the type of task where multi-model consensus has a theoretical advantage. This is a system benchmark, not a general-purpose LLM comparison.

2. **22% inconclusive rate.** Multi-judge agreement is challenging for some expert questions, particularly in financial and medical domains where clinical/regulatory nuance creates legitimate disagreement between judges.

3. **Point-in-time evaluation.** Conducted during March 2026 using specific model versions. Results may vary with model updates.

4. **Combined metric weighting.** The 0.6 factual + 0.4 quality weighting reflects a design choice. Alternative weightings are available in the raw data for independent analysis.

---

## What This Means

### For decision-makers (non-technical)

**The core finding is simple: LLM Consensus never gives you a worse answer than the best AI model available, and nearly half the time it gives you a better one.**

When you ask a complex question to GPT-5.4, Claude Opus, or Gemini individually, you get one expert perspective. Some models are better at certain topics than others, and you don't always know which one to trust. LLM Consensus solves this: it consults all of them, cross-checks their answers, and delivers a single response that captures the best of each.

In our evaluation across 78 expert questions in finance, law, medicine, and technology:

- **In 45% of cases, the combined answer was measurably better** than what any single model produced alone. It caught details that individual models missed, resolved contradictions between them, and produced more complete responses.
- **In 55% of cases, the combined answer matched the best individual model.** No improvement, but no degradation either — you got at least the best available answer.
- **In 0% of cases did the combined answer make things worse.** Zero degradation. When you use LLM Consensus, you never get a worse result than if you had picked the right model on your own — except you don't have to guess which model is "right" for each question.

This matters most in high-stakes domains where being wrong is expensive: regulatory compliance, clinical decisions, legal analysis, and critical architecture choices.

**Where it helps most:**
- **Clinical medicine (59% improvement):** Drug interactions, multi-comorbidity management, guideline interpretation — areas where combining 3 clinical perspectives catches what any single model might miss.
- **Financial regulation (50% improvement):** Cross-framework questions involving DORA, PSD2, SFDR, Basel — where no single model consistently covers all applicable regulations.
- **Legal analysis (44% improvement):** Multi-jurisdictional and cross-regulatory questions where synthesizing different legal frameworks produces more complete answers.
- **Technical architecture (30% improvement, 70% match):** System design decisions where the consensus consistently matches the best expert recommendation.

### For technical audiences

**The system achieves weak dominance over the best individual model: consensus >= max(individual) in 100% of decisive evaluations, with strict improvement in 44.9%.**

Key technical observations:

1. **Non-inferiority is robust across domains.** The zero loss rate holds independently in all 4 evaluation domains (financial, legal, medical, technical), not just in aggregate. This suggests a systematic property of the orchestration architecture, not a statistical artifact.

2. **Wins concentrate in synthesis-heavy tasks.** The 59% win rate in medical and 50% in financial correlate with questions requiring integration of 3+ knowledge sources (clinical guidelines, regulatory frameworks). In technical architecture, where questions often have a single defensible answer, consensus matches the best individual (70% tie) but adds less incremental value (30% win).

3. **All measured degradation is zero, but with caveats.** The 22% inconclusive rate (questions excluded due to low inter-judge agreement) means the evaluation coverage is 78/100. The zero loss result holds on the evaluable subset; the inconclusive questions are published in full for independent analysis. We do not claim zero degradation on the inconclusive set.

4. **The evaluation measures combined factual + quality signal.** The combined metric (0.6 factual + 0.4 quality) reflects that users care about both accuracy and usefulness. Under a factual-only metric, the win rate is lower and a small number of ties would shift to marginal losses. Under a quality-only metric, the win rate is higher. Raw scores for both dimensions are published for reanalysis under alternative weightings.

5. **The "best individual" baseline is conservative.** For each question, the baseline is the single real model with the highest combined score — not a synthetic "oracle" combining the best factual score from one model with the best quality score from another. This makes every win a win against an actual, achievable individual model response.

6. **Provider reliability is a separate concern.** This evaluation measures answer quality on successfully completed orchestrations. Operational metrics (timeout rates, provider availability) are tracked separately and do not affect the quality scorecard.

---

## Naming & Scope

**LLM Consensus Expert-Domain Evaluation Benchmark v1.0**

This benchmark evaluates a multi-model consensus system on domain-specific expert synthesis tasks. It is not a general-purpose LLM benchmark and should not be compared directly to MMLU, HELM, or Chatbot Arena, which evaluate individual models on different task distributions.

---

## Contact

- General: hello@llmconsensus.io
- Technical: Benchmark dataset and methodology available at llmconsensus.io/benchmark
- Patents: US 19/215,933, EU EP25176020.3 (pending)
- Entity: Healthtech Capital LLC
