Scorecard
An eval-driven development maturity model
Most teams building with LLMs are further down this ladder than they think. The honest version of "how good are your evals?" is a maturity question, not a yes/no — and the gap between a vibe check and a calibrated, statistically-read suite is five distinct rungs. Here is the scorecard, what each rung catches and misses, and the one trigger that tells you it is time to climb.
Why a ladder, not a switch
The codex traces a consistent maturity progression in the practitioner and vendor literature: vibe check → deterministic code checks → LLM-as-judge → CI-gated suite → calibrated and statistical. Nobody arrives at the top rung on day one, and trying to is its own failure mode — the same authors who popularized evals warn against writing evaluators for errors you imagine rather than errors you discover. You climb by looking at real outputs and letting the suite grow from what actually breaks. Each rung is cheap relative to the one above it and catches a class of failure the one below cannot. Skip a rung and you pay for the expensive machinery without the cheap coverage underneath it.
The five levels at a glance
| Level | What it looks like | What it catches | What it misses | Trigger to level up |
|---|---|---|---|---|
| 0 — Vibe check | A human eyeballs a few outputs and decides "looks good." No dataset, no recorded criteria. | Egregious, obvious breakage on the handful of cases you happen to look at. | Everything you didn't look at. Hallucinations slip to prod; "good" is undefined and drifts per reviewer. | You're re-checking the same cases by hand on every change, or a regression reached users. |
| 1 — Deterministic code checks | Assertions and unit tests on verifiable properties: JSON parses, totals match, a regex blocks leaked UUIDs, the migration runs. | The obvious ~80% — format errors, schema violations, broken outputs, anything executable. | Anything subjective: tone, helpfulness, faithfulness, "is this answer actually right?" | You keep finding real failures that no assertion can express. |
| 2 — LLM-as-judge (validated) | A model grades the unverifiable dimensions, with binary pass/fail plus a written critique — and the judge is checked against human labels. | Subjective quality at scale: was it grounded, on-tone, complete, on-task. | Judge bias (position, verbosity, self-preference); drift; style rewarded over substance if uncalibrated. | You're grading more than ~100 outputs by hand, or you need the suite to gate merges automatically. |
| 3 — CI-gated suite | The layered stack runs on every change as a build gate. Regression evals near 100%; capability evals deliberately start low. Built from error analysis on real failures. | Regressions before users do — model upgrades, prompt edits, and agent changes that backslide. | That a single green run is real. No error bars, no reliability vs. capability split, no production signal. | Your deltas are inside the noise, or prod surprises you despite a green suite. |
| 4 — Calibrated & statistical + online | Error bars and N on every score; pass^k reliability; judge calibration (Cohen's κ) on a cadence; online evals on sampled production traffic. | Noise masquerading as progress; unreliable-but-capable agents; judge drift; novel and drifting prod failures. | Nothing a closed, finite spec can't — contamination, saturation, and Goodharting still need active defense. | This is the top rung. The work shifts from climbing to defending the suite. |
Level 0 — Vibe check
Someone runs the feature, reads a few responses, and ships on "looks right." There is no dataset and no written-down definition of done, so "good" means whatever the reviewer felt that afternoon. The codex is blunt about this rung: subjective spot-checks don't scale and let hallucinations slip to production. It is not worthless — a vibe check is a real, fast signal of day-to-day usefulness, and even rigorous benchmark teams keep a vibe-checking tier alongside hard capability probing. But as your only guardrail it fails silently, because the failures live in the cases you never looked at.
Trigger to level up You catch yourself manually re-checking the same handful of cases on every change — or a regression you "would have noticed" reached users anyway.
Level 1 — Deterministic code checks
The cheapest real eval is a code assertion, and it should be your first and largest layer. Wherever the thing you care about is verifiable — the total matches, the JSON parses, the migration runs, a regex catches a leaked UUID — write a deterministic check and run it constantly. The Rechat case study in the codex got past a prompt-engineering plateau on the back of hundreds of cheap unit-test-style assertions. This rung catches the obvious 80% of failures before you ever pay for a model to grade anything, which is exactly why skipping it to jump straight to an LLM judge is a waste. See why unit tests aren't enough for what this layer can and can't reach.
Trigger to level up You keep finding real, recurring failures — wrong tone, ungrounded claims, subtly incorrect answers — that no code assertion can express. That is the boundary of determinism.
Level 2 — LLM-as-judge for the unverifiable
Where no deterministic check exists, you grade with a model. A strong judge reaches roughly human-level agreement (~80% on the right tasks) — but it carries position, verbosity, and self-preference bias, lands near random on objectively-verifiable correctness, and can reward style over substance. So the rung is not "add an LLM judge"; it is "add a judge and validate it against human labels." Prefer binary pass/fail with a written critique over a 1–5 Likert dashboard — the codex names arbitrary uncalibrated scales and generic "helpfulness" judges as anti-patterns that manufacture false confidence. Reaching this rung without the validation step means you've traded a known-fallible human for an unknown-fallible model and called it rigor.
Trigger to level up You're hand-grading more than ~100 outputs per cycle, or you need the suite to block a merge automatically rather than waiting for someone to run it.
Level 3 — CI-gated suite
Now the layered stack — deterministic checks, validated judge, human sampling — runs on every change as a build gate, blocking a deploy when scores fall below threshold. The defining move at this rung is the split the codex draws from Anthropic's practice: regression evals should sit near 100% pass and catch backsliding on model upgrades and prompt edits, while capability evals deliberately start at low pass rates as bets on what the system should soon be able to do. Two eval types, opposite target scores. And the suite is grown from error analysis — 20–50 tasks drawn from real failures (bug trackers, support queues, production traces), not an imagined matrix of query types. The mechanics of building one are in how to write evals for an AI coding agent.
Trigger to level up Your before/after deltas are small enough to be noise, you can't tell capability from reliability, or production surprises you despite an all-green run.
Level 4 — Calibrated, statistical, and online
The top rung reads results like a statistician and watches production. Four things change:
- Error bars and N. An eval is an experiment sampling from a super-population. Report standard errors (clustered when questions come in groups — they can run more than 3× the naïve SE), run multiple samples per question, and use paired, question-level inference for model comparisons. A score without an interval is a point estimate pretending to be a measurement.
- pass^k reliability, not a flattering peak. Distinguish whether the system can pass (capability, pass@k) from whether it will every time (reliability, pass^k). A 70%-reliable agent reads as ~97% at pass@3 but ~34% at pass^3. Ship on reliability.
- Judge calibration on a cadence. Re-check the LLM judge against fresh human labels and track agreement (e.g., Cohen's κ ≥ 0.60) so you catch drift before it quietly moves your pass rate.
- Online evals on real traffic. Offline/CI evals catch known regressions; online evals on sampled production traffic catch drift, novel inputs, and silent provider model changes. Close the loop — today's production failure becomes tomorrow's golden-set case.
Reaching this rung does not make the suite trustworthy by default — it makes it trustworthy enough to be worth attacking. Which is the catch: the higher you climb, the more the failure modes shift from "we have no signal" to "our signal is being gamed."
The failure modes you inherit at the top
Every eval is a Goodhart target — "when a measure becomes a target, it ceases to be a good measure." The codex catalogs the specific ways a sophisticated suite still misleads, and they get worse as you mature, because a more powerful suite is a more attractive thing to optimize against:
- Contamination Public eval items leak into training data, so a green run can measure memorization, not skill — de-contaminating has moved scores by double digits. Prefer freshly-authored, private, or time-gated cases.
- Saturation When every candidate goes all-green, that's a signal your suite stopped discriminating, not that everything is equally good. Escalate difficulty and add held-out tasks.
- Goodharting / gaming An optimizing agent will satisfy the literal pass condition while missing the intent; reward-hacking a gameable eval has been shown to generalize into broader misalignment. Harden graders, isolate environments, read the transcripts.
- Style-over-substance judges LLM-judge preferences can fail to correlate with factuality, safety, or instruction-following. Calibration is not optional at this rung; it's the load-bearing defense.
Score yourself
Pick the highest level where you can answer yes to every line below it. Most teams overestimate by a rung, because a few good assertions feel like a "suite."
- L1 Do verifiable properties (parsing, totals, schema) have automated assertions that run on every change?
- L2 Is anything subjective graded by a judge that has been validated against human labels — binary, with critiques, not a Likert dashboard?
- L3 Does the suite gate CI, with regression evals near 100% and capability evals tracked separately, all grown from real failures?
- L4 Do your scores carry error bars and N, report pass^k, recalibrate the judge on a cadence, and run on sampled production traffic?
If you can't say yes to the L1 line, you're at Level 0 — and that's the cheapest gap to close.
How to move up one level this quarter
- Look at your data first. Spend ~30 minutes reading 20–50 real outputs and write a note on the first failure in each. This is the highest-leverage activity at every rung, and it tells you which level you actually need next.
- Convert each recurring failure into the cheapest check that catches it. Code assertion if you can execute it; a validated judge prompt only if you can't.
- Add the one missing capability for your target rung — a CI gate (L3) or error bars plus a pass^k report and a first online eval on sampled traffic (L4) — and nothing else. One rung per quarter.
- Write down what each eval claims to measure and check the gap between the name and what the score rewards. This is what stops a mature suite from being confidently wrong.
- Feed production failures back into the golden set so the spec keeps co-evolving with reality instead of ossifying.
Start at the foundations — what an eval even is in eval-driven development — or see how the rhythm maps to testing in eval-driven development vs. TDD.
Get new eval-driven development essays by email
Practical evals for AI-assisted and agent code — the executable spec and the guardrail, not the vibe check. No spam, unsubscribe anytime.
Grounded in the EDD codex — esp. Part VI (the practice and the vibe-check → eval-suite maturity progression), the cross-cutting synthesis, Part I (eval statistics, pass@k vs pass^k), and Part VIII (contamination, saturation, Goodharting, and the failure modes that intensify at higher rungs).