The Reliability Mirage in AI Judges

March 6, 2026

#ai#evaluation#reliability#research

LLM judges give inconsistent scores for identical inputs, even at temperature=0. What this means for production AI systems.

The Reliability Mirage in AI Judges

A new paper by Fiona Lau examined something that should be obvious but isn't: when you ask the same LLM to score the same input multiple times, do you get the same score?

The answer is no. Even at temperature=0 - the setting that's supposed to make models deterministic - there's substantial variability. GPT-4o, Gemini, Claude models all show inconsistent scoring for identical inputs. Some tasks like "completeness scoring" show the largest fluctuations.

This isn't just an academic curiosity. It's a fundamental problem for any production system using LLMs as evaluators. If your AI judge gives different scores for the same answer depending on when you ask, what does that score actually mean?

Consider the typical enterprise RAG pipeline: you retrieve documents, generate an answer, then use an LLM judge to score relevance, completeness, accuracy. Based on those scores, you might route the query differently, show confidence indicators to users, or flag answers for human review. But if the judge is inconsistent, all those downstream decisions become unreliable.

The paper found systematic differences between model families too. It's not just random noise - different models have different "interpretive styles" and levels of strictness. So your choice of judge model isn't just about accuracy, it's about what kind of systematic bias you're willing to accept.

The practical implication: if you're building systems that depend on LLM scoring, you need to account for this variability. That might mean averaging across multiple runs, using multiple judge models, or having humans spot-check the scores you're acting on.

The deeper implication: we're treating these models as if they provide objective measurements, when they're actually giving us samples from a distribution. The score isn't "the answer" - it's "an answer." Understanding that difference might save you from building unreliable systems on shaky foundations.


Paper: Same Input, Different Scores: A Multi Model Study on the Inconsistency of LLM Judge