A multi-model consensus system matches or outperforms GPT-5.4, Claude Opus 4.6 and Gemini 3.1 Pro across 100 expert-level questions in finance, law, medicine and technology, with no performance degradation.
LLM Consensus has released the results of its Expert-Domain Evaluation Benchmark v1.0, an independent study analyzing the performance of its multi-model consensus technology across 100 high-complexity questions in areas such as financial regulation, law, clinical medicine and technical architecture.
According to the results, the system matches or outperforms the best individual AI model across all evaluated questions, achieving measurable improvement in 44.9% of cases and with no instances of performance loss.
In nearly half of the questions (45%), responses generated by the consensus system clearly outperformed those of the best individual model. The system was able to identify regulatory details that other models missed, resolve contradictions across sources, and deliver more complete answers.
In the remaining 55%, performance matched that of the best available model, ensuring a consistent baseline of quality without requiring users to choose between different models.
Additionally, in none of the 100 questions analyzed did the system produce a worse result than an individual model.
The analysis focused on complex questions typical of regulated industries:
The use of artificial intelligence in regulated industries continues to grow, yet no single model consistently excels across all domains. A system may perform well in financial regulation but fall short in clinical medicine, or vice versa.
LLM Consensus addresses this challenge by combining multiple leading models into a single response. It integrates technologies from OpenAI, Anthropic, Google, Mistral, and Meta, applying a synthesis process with cross-verification that leverages each model's strengths while reducing their weaknesses.
"Reliability is the core value proposition," the company said. "Users no longer have to decide which model to use. They get a single answer that consistently matches or outperforms the best available model for each case."
The benchmark was specifically designed to assess tasks that require combining multiple sources of knowledge. Each question was evaluated by three independent reviewers from different AI providers, who scored responses blindly based on accuracy and quality.
Responses — from both the consensus system and individual models — were presented anonymously and in random order. Cases where sufficient agreement was not reached were classified as inconclusive and excluded from the final results.
The full dataset has been published to enable independent verification.