Eval-Driven Development

Comparison

Eval-driven development vs. test-driven development

Eval-driven development borrows TDD's rhythm — write the check first, then build until it passes — and applies it to systems where the output is non-deterministic. The rhythm carries over. What changes is what a "check" can be, and how you read its result.

The shared shape

Test-driven development, as Kent Beck framed it, is a tight loop: write a failing test, make it pass with the simplest change, refactor. The test is an executable specification written before the code. EDD keeps that spine. You define an eval — a dataset, a success criterion, and a grader — and the AI iterates until the eval passes. "Evals are the new unit tests" is the slogan, and it is half right.

Several things genuinely carry over from TDD:

Where the analogy breaks

TDD was designed for deterministic software graded by exact match: the function returns 4 or it doesn't. LLM output is probabilistic, and a single response can be simultaneously accurate but too long, or well-formatted but incomplete. That difference cascades:

Test-driven developmentEval-driven development
System under testDeterministic codeNon-deterministic model / agent behavior
ResultBinary pass/fail, exact matchOften graded across multiple dimensions; a statistical estimate
GradersCode assertionsCode plus LLM-as-judge plus human review
Determinism of the checkThe test itself is deterministicAn LLM-judge grader is itself non-deterministic and biased — it must be validated
Reading the resultGreen = doneRun many times; report reliability (pass^k) and error bars, not one run
When the spec is writtenMostly up frontDiscovered by grading real outputs ("criteria drift")
FlakinessA bug to eliminateInherent — managed with sampling and thresholds, not eliminated

Three differences that matter most

1. The grader can be a model — and that model is fallible. Where no deterministic check exists (tone, helpfulness, faithfulness), you grade with an LLM judge. A strong judge can reach roughly human-level agreement, but it carries position, verbosity, and self-preference bias, and lands near random on objectively-verifiable correctness. In TDD the test is the ground truth; in EDD you often have to validate the grader itself against human labels before you can trust a green run.

2. A pass is statistical, not absolute. The same input can pass on one run and fail on the next. So an eval result is an estimate: run it multiple times, report a range, and distinguish capability (can it pass — pass@k) from reliability (does it pass every time — pass^k). A 70%-reliable agent reads as ~97% at pass@3 but ~34% at pass^3. TDD never had to make that distinction.

3. You can't write all the evals first. The strict "test-first" move doesn't fully transfer. The practitioners who popularized evals are explicit about it: write evaluators for the errors you discover, not the errors you imagine. You need to grade real outputs to learn what your criteria even are — so the eval suite is grown from error analysis, not authored up front. The spec and the evals co-evolve.

So do evals replace unit tests? No.

Deterministic tests are not obsolete under EDD — they are the first and cheapest layer of the eval stack. Use a code assertion wherever the thing you care about is verifiable (the total matches, the JSON parses, the migration runs), and catch the obvious 80% before you ever pay for a judge. The pattern that grades AI-written code at scale — running real test suites where the bug-fix tests must pass and the regression tests must stay green — is just unit testing applied to a patch. Evals extend testing to the things tests can't express: behavior, quality, and grounding.

The rule of thumb:

The same caveat as TDD, only sharper A passing suite is necessary, not sufficient. TDD could give false confidence if the tests were shallow; evals add new ways to be fooled — contamination, a judge that rewards style over substance, an agent that games the grader. Treat green as "no known regressions," not "correct," and pair evals with held-out tests and real-world feedback.

Bottom line

Eval-driven development is TDD's successor for the AI era, not a replacement for testing. Keep the discipline — check first, build to pass, gate in CI, guard against regressions — and add three things the probabilistic world demands: graders that can be models, results read as statistics, and a spec you discover by looking at real failures. Tests tell you the code is right; evals tell you the behavior is good enough to keep.

Next: where EDD sits relative to TDD and BDD, how to write evals for an AI coding agent, or the underlying evidence in the codex.

Grounded in the EDD codex — esp. Part VI (the practice and the TDD analogy), Part I (eval statistics, pass@k), Part II (LLM-as-judge bias), and Part III (execution-based grading of code).