Eval-Driven Development

Technique

LLM-as-judge evals: when and how (and when not)

An LLM judge is a model you point at another model's output and ask "is this good?" It is the only practical way to grade qualities no assertion can express — tone, helpfulness, faithfulness. It is also fallible, biased, and gameable. Used in the right place and validated, it is a real guardrail. Used as a default, it quietly lies to you.

When a judge is the right tool — and when it isn't

Reach for an LLM judge only when the thing you care about is subjective quality you cannot grade with code. If a deterministic check exists — the JSON parses, the total matches, the test suite passes — use it. A code grader is faster, cheaper, reproducible, and can't be talked into a higher score.

The boundary is sharp, and the codex draws it with hard evidence. On tasks with verifiable ground truth — knowledge, reasoning, math, code — LLM judges perform close to random when asked to pick the correct answer. They are not good at deciding which of two answers is right. They are good at approximating which of two answers a human would prefer. Those are different jobs. So:

Even when a judge fits, it is the last resort in the stack, not the first. Vendor guidance consistently ranks grading methods code-based first, human review second, and LLM-based last — flexible but to be tested before you scale it. The judge earns its place only where rules genuinely can't capture the nuance.

How to make the judge trustworthy

A judge is a prompt, and most of its quality lives in the rubric. The patterns that move it from coin-flip to credible recur across vendor playbooks and the research:

  1. Write a rubric with concrete anchors, not "rate 1–5." Show, don't tell: describe exactly what passes and what fails. Vague scales produce vague, drifting scores.
  2. Reason before you score (CoT). Make the judge lay out its reasoning first, then emit a verdict. This is the single most repeated lever for lifting agreement with humans; the G-Eval pattern — reason from the rubric, then fill in the score — is the reference design.
  3. Give a reference answer when you have one. Reference-based grading dramatically outperforms reference-free; judge reliability drops noticeably without a reference in the prompt. A good answer to compare against turns a vibe check into a comparison.
  4. Prefer binary pass/fail over a Likert scale. A 1–5 number is hard to anchor and easy to fudge; pass/fail with a written critique is more stable and forces a crisp criterion.
  5. Pin the temperature and version the prompt. The judge is itself non-deterministic. Hold its settings fixed so a "pass" is reproducible and a score change means the output changed, not the weather.

A small worked rubric for a faithfulness judge — anchors, reason-before-score, one binary verdict:

FAITHFULNESS — does the answer stay grounded in the provided source?

First, reason step by step: list each claim in the answer, then check
whether the source supports it. Only after reasoning, output a verdict.

PASS  — every claim is supported by the source; no invented facts,
        numbers, or entities; "I don't know" when the source is silent.
FAIL  — any claim contradicts the source, or adds a fact the source
        does not contain (a hallucination), however fluent it sounds.

Ignore tone, length, and writing style. Judge only grounding.
Output: reasoning (3-6 sentences), then VERDICT: PASS or FAIL.

The biases, and how to blunt them

A strong judge can reach roughly the level of agreement humans have with each other on open-ended preference — the headline that makes judges viable. That headline hides systematic, exploitable biases. Three are first-class, and each has a mitigation:

BiasWhat it doesMitigation
PositionFavors an answer by where it sits in the prompt. Severe enough to flip verdicts purely by reordering — the same two answers can swap winners.Score in both orders and average (swap-and-average). Never trust a single-order pairwise verdict.
Verbosity / lengthRewards longer answers for being longer, so a generator can inflate its score by padding.Control for length — instruct against it, cap it, or statistically debias the preference so length stops paying off.
Self-preferenceOver-rewards text that "sounds like" the judge — familiar, low-perplexity output, including from its own model family.Don't grade your own family blindly; where you can, judge with a different model than the one that generated.

A related design choice: pairwise (A-vs-B) grading is more stable than absolute scoring but amplifies bias and is more easily gamed by spurious features; absolute scoring is noisier but more robust to manipulation. Pick the protocol to fit the task rather than defaulting to one.

Validate the judge before you trust it

A judge is a measuring instrument, and an uncalibrated instrument is worse than none — it gives you confident numbers that are wrong. Who validates the validators? You do, against human labels, before the judge gates anything.

It's not the judge that creates the value The most-cited practitioner thesis is blunt: the payoff isn't the automated grader, it's the process of forcing a human to look at real data. The judge encodes what you learned by reviewing outputs. Skip the review and an off-the-shelf multi-metric 1–5 judge will mostly lead you astray.

A judge is an attack surface

Anything that gates production is a target. Short universal adversarial suffixes can be appended to an output to push a judge toward maximum scores regardless of actual quality, and such attacks can transfer across judge models. Absolute scoring is more vulnerable than comparative. The practical takeaways: don't let a single judge be the only gate in an adversarial or high-stakes setting, watch for outputs that game the grader rather than the task, and keep a human or a deterministic check in the loop where the stakes justify it. A judge is a useful guardrail, not a tamper-proof one.

When to use a judge — and when not

Situation
Use a judgeGrading subjective quality with no deterministic check — faithfulness, tone, helpfulness, completeness.
Use a judgeYou have a rubric with anchors, reason-before-score, and a reference answer to compare against.
Use a judgeYou've validated it against human labels and re-validate periodically for drift.
Don't use a judgeThe answer is objectively verifiable (math, code, format, exact match) — use a test or code grader.
Don't use a judgeYou'd grade an output with a model from its own family and call it neutral.
Don't use a judgeIt's a generic, unvalidated 1–5 metric you wired up without looking at any data.
Don't use a judgeIt is the sole gate in an adversarial setting where outputs can be crafted to fool it.

Bottom line

The LLM judge is the part of the eval stack you reach for last and trust least without proof. Use it only where quality is genuinely subjective; build it with a clear anchored rubric, reason-before-score, a reference answer, and a pinned temperature; blunt position, verbosity, and self-preference bias; validate it against human labels with precision/recall and kappa before it gates anything; and remember it can be gamed. Done that way, a judge grades the things tests can't — and stays honest enough to be worth grading with.

Next: how to write evals for an AI coding agent, how to build an eval harness for an LLM app, or the underlying evidence in the codex.

Grounded in the EDD codex — esp. Part II (LLM-as-judge: bias, rubric design, CoT grading, validation, robustness) and the cross-cutting synthesis on when a judge is the right grader.