Comparison
Eval-driven development vs. TDD and BDD
Eval-driven development is the third entry in a family of practices that all make the same move: write the check first, then build until it passes. Test-driven development put that check in code. Behavior-driven development made it readable and shared. Eval-driven development extends it to the one thing the first two assumed away — output that isn't deterministic.
The move all three share
TDD, BDD, and EDD are all forms of executable specification. In each, you state what "done" means as something you can run, before or as you build, and you let that artifact drive the work and guard against regressions. The differences are about who writes the check, in what language, and what kind of system it grades.
Test-driven development (TDD)
Kent Beck's TDD is the tight red-green-refactor loop: write a failing unit test, make it pass
with the simplest change, refactor. The test is an executable spec written by a developer,
at the level of a unit, and graded by exact match — the function returns 4 or it
doesn't. It is fast, precise, and unambiguous, and it is the bedrock the other two build on.
Behavior-driven development (BDD)
Dan North's BDD grew out of TDD to answer two questions TDD left open: what should you test, and how do you keep the spec aligned with what the business actually wants. BDD shifts the vocabulary from "tests" to behavior, written in a shared, near-natural language so that developers, QA, and business stakeholders — the "three amigos" — agree on it together. Its signature form is specification by example: Given-When-Then scenarios (Gherkin, Cucumber) that are both human-readable and executable. BDD is outside-in (start from the behavior a user wants) where TDD is inside-out (start from the unit). Under the hood the assertions are still deterministic; what BDD adds is readability, a behavioral frame, and collaboration.
Eval-driven development (EDD)
EDD keeps the family's spine and extends it to AI-assisted and agent software, where the output is non-deterministic. You define evals — a dataset of real inputs, a success criterion, and a grader — and the AI iterates until they pass. The leap is that the thing under test can be probabilistic (the same input can pass once and fail the next run), and the criterion is often not exact-matchable. So EDD adds two things neither predecessor needed: graders that can be code, an LLM-as-judge, or a human, and results read as statistics rather than a single green tick.
Side by side
| TDD | BDD | EDD | |
|---|---|---|---|
| Originated | Kent Beck (~2003) | Dan North (~2006) | Emerging (2024–2026) |
| Drives | Deterministic code | Behavior + shared understanding | AI-assisted and agent behavior |
| The "spec" is | A unit test | A Given-When-Then scenario | An eval (dataset + criterion + grader) |
| Written by | Developers | Devs + business + QA | Devs + domain experts (from real failures) |
| Expressed in | Code | Near-natural language (Gherkin) | Examples + a rubric (code or judge) |
| System under test | Deterministic units | Deterministic behavior, end-to-end | Non-deterministic model / agent |
| Grader | Code assertion | Step definitions (code) | Code or LLM-judge or human |
| Result | Binary, exact match | Binary, scenario pass | Graded; statistical (pass^k, error bars) |
| Spec is set | Up front (test-first) | Up front (with stakeholders) | Discovered via error analysis (criteria drift) |
| Typical failure | Brittle, over-specified tests | Scenario bloat / "Cucumber theater" | Contamination, judge bias, Goodharting |
The same feature, three ways
The family resemblance is clearest when you write the same intent in each style.
TDD — a developer-level assertion, graded by exact match:
// TDD — a deterministic unit test, written first.
test("adds two numbers", () => {
expect(add(2, 2)).toBe(4);
}); BDD — the behavior, in language everyone can read:
# BDD — a behavior scenario in shared, near-natural language.
Feature: Shopping cart
Scenario: Adding a second item
Given a cart containing 1 item
When I add another item
Then the cart shows 2 items EDD — the same Given-When-Then shape, but the "Then" is graded, and a single run isn't enough:
# EDD — an eval case. Same Given/When/Then shape,
# but the "Then" is graded, and you read it statistically.
Given: a customer email asking for a refund (case #214)
When: the support agent drafts a reply
Then: - reply is on-topic and polite [grader: LLM-judge + rubric]
- reply never promises a refund [grader: code / regex]
- no personal data is leaked [grader: code]
Run it 5 times; require pass^5 (passes every time), not a lucky run. EDD is closest to BDD — with one big twist
It's tempting to frame EDD as "TDD for AI," but it is spiritually nearer to BDD. Both specify behavior by example; both are outside-in (start from what a user should get); both depend on a shared understanding built with domain experts — BDD's ubiquitous language has a direct echo in EDD's rubric, hammered out with a "principal domain expert" who decides what good output looks like. An eval case even fits the Given-When-Then mould: Given an input, When the model or agent acts, Then the result satisfies a criterion.
The twist is everything that follows from non-determinism:
- The "Then" is graded, not asserted. Where BDD checks an exact outcome, EDD often grades quality, behavior, or faithfulness — sometimes with an LLM-judge, which is itself fallible and must be validated against human labels.
- A pass is statistical. The same case can pass once and fail next time, so you run it repeatedly and report reliability (pass^k, the chance it passes every time) and error bars — not a single green run.
- The spec is discovered. BDD writes scenarios up front with stakeholders; EDD leans on "criteria drift" — you learn your real criteria by grading actual outputs, so the eval set grows from error analysis, not an imagined list.
Do they replace each other? No — they layer
EDD does not retire TDD or BDD; it extends the family to the parts of a system that classical tests can't express. In a modern AI product all three coexist:
- TDD for the deterministic logic around the model — the plumbing, the parsing, the business rules.
- BDD for end-to-end behavior you can pin to an exact outcome, in language the whole team shares.
- EDD for the model and agent behavior you can't pin down deterministically — graded, sampled, and read like an experiment.
Reach for the cheapest one that fits: a deterministic test if the answer is verifiable, a behavior scenario if a stakeholder needs to read it, and an eval only when the output is probabilistic enough that exact-match would lie.
Bottom line
Eval-driven development is the AI-era heir to a thirty-year lineage, not a break from it. If you've done TDD or BDD, you already know the rhythm: write the check first, build to pass it, gate it in CI, guard against regressions. EDD asks you to add three things the probabilistic world demands — graders that can be models, results read as statistics, and a spec you discover by looking at real failures.
For the focused head-to-head, see eval-driven development vs. test-driven development. Then start with the definition, learn how to write evals for a coding agent, or dig into the evidence in the codex.
Get new eval-driven development essays by email
Practical evals for AI-assisted and agent code — the executable spec and the guardrail, not the vibe check. No spam, unsubscribe anytime.
Grounded in the EDD codex — Part VI (the practice and the TDD analogy), Part I (evals as experiments, pass@k vs pass^k), Part II (LLM-as-judge and its biases). TDD (Kent Beck) and BDD (Dan North; Gherkin/Cucumber; specification by example) are the analogy anchors.