Eval-Driven Development

How-to

How to write evals for an AI coding agent

You handed an agent your repo and it opened a pull request. How do you know the change is good — not just that the agent said it fixed the issue? Evals are how. For coding agents the answer is unusually clean: the tests are the spec, and an eval is a task the agent must make pass.

Step 1 — Start from real failures, not imagined ones

The single highest-leverage move is error analysis: look at what your agent actually gets wrong. Pull 20–50 real tasks from your bug tracker, recent pull requests, and the agent's own transcripts. Categorize the failures. Your first eval set should be those failures — the spec is discovered by reading outputs, not invented at a whiteboard. Twenty to fifty tasks drawn from real failures is a great start.

Step 2 — Make each task executable

An agent eval is a task with three parts: a starting repo state, an instruction, and a grader made of tests. Borrow the pattern that the SWE-bench family standardized: the fix's tests must pass (fail_to_pass) and the existing tests must stay green (pass_to_pass). Run it in a container so the result is reproducible.

task: "fix-pagination-off-by-one"
repo_state: "your-repo at the commit before the fix"
instruction: |
  Users report the last item on each page is missing.
  Fix the pagination bug in src/list.py.
grade:
  fail_to_pass:        # must PASS after the change (proves the fix)
    - tests/test_list.py::test_last_item_visible
  pass_to_pass:        # must STILL pass (no regressions)
    - tests/test_list.py::test_first_page
    - tests/test_list.py::test_empty_list
  trials: 5            # run 5x; report pass^5, not a lucky single run

That is the whole idea: apply the agent's diff, run the prescribed tests, mark the task resolved only if the fix tests pass and nothing regressed. It is unit testing pointed at a patch instead of a function.

Step 3 — Grade the outcome, not the path

Resist the urge to assert an exact sequence of tool calls. Agents routinely find valid approaches you didn't anticipate, so step-by-step matching produces brittle evals that fail on good work. Grade what the agent produced — the tests pass, the final state is correct — and reserve trajectory checks for cases where the process genuinely matters (e.g. "must not touch the payments module"). For long tasks, allow partial credit: an agent that localizes the bug but botches the fix is further along than one that flails immediately, and your eval should be able to see that.

Step 4 — Measure reliability, not a lucky run

Agents are non-deterministic, so a single green run is weak evidence. Run each task several times and report pass^k — the probability it passes every time — not just pass@k, the probability it passes at least once. A 70%-reliable agent looks like ~97% at pass@3 and ~34% at pass^3; for anything you'd let run unattended, the second number is the one that predicts your review burden.

Step 5 — Add a judge only for what code can't grade

Most of what matters about a code change is verifiable, so keep it on code-based graders. Where you genuinely can't — "is the PR description accurate?", "is this refactor readable?" — an LLM-as-judge can help, but treat it as code you have to test: pin its temperature, give it a concrete rubric, randomize ordering, and validate it against your own labels until it agrees before you let it gate anything. Never use a judge for something a test could decide.

Step 6 — Gate in CI, and split regression from capability

Wire the suite into the pipeline so it runs on every agent change and model upgrade. Keep two kinds of evals with opposite targets:

Step 7 — Read the transcripts

The eval harness is the most common failure point — graders are wrong more often than you'd think. Real example from the field: a capable model scored 42% on a benchmark until the grading bugs were fixed, then jumped to 95%. Read the transcripts of passes and failures alike, and specifically watch for two traps:

The one-page checklist
  • 20–50 tasks from real failures, containerized and reproducible.
  • Each task = repo state + instruction + fail_to_pass & pass_to_pass tests.
  • Grade the outcome; allow partial credit; avoid exact-path matching.
  • Run multiple trials; report pass^k reliability.
  • LLM-judge only for the unverifiable, and validate it first.
  • CI gate; regression evals ~100%, capability evals start low.
  • Read transcripts; check for leakage, weak tests, and reward hacking.

Where this goes next

Once the harness exists, it becomes the substrate for everything else: regression evals that catch agent drift after a model upgrade, capability evals that tell you when to raise the agent's autonomy, and the safety property that lets an agent change your codebase without breaking it. That last one is the real payoff — evals are what make a codebase safely modifiable by AI.

See also: EDD vs TDD for how this relates to testing, and the codex (Parts III, IV, and VI) for the benchmarks, agent-eval methods, and evidence behind each step above.

Grounded in the EDD codex — Part III (execution-based grading, fail_to_pass/pass_to_pass, contamination), Part IV (agent trajectories, harness pitfalls, reliability), and Part VI (error analysis, CI gates, reading transcripts).