Eval-Driven Development

How-to

Eval-driven development with Claude Code, Cursor, and Copilot

The coding assistant in your editor is a generator. It produces a diff and tells you, confidently, that it works. What turns that into something you can actually ship is a verifier — an eval suite the change has to pass. Get that asymmetry right and the choice of assistant becomes a detail. The eval suite is the spec and the guardrail; the tool is just the thing that types.

The workflow comes first; the tool is interchangeable

Before naming any product, fix the loop, because it is the same regardless of which assistant you point at the repo:

  1. Write the eval as the spec. For coding work this is unusually clean: the tests are the spec. An eval is a task — a starting repo state, an instruction, and a grader made of tests that must pass (fail_to_pass) while the existing tests stay green (pass_to_pass). The agent's job is to make that grader go green.
  2. Point the agent at the repo and let it iterate. Modern agent modes run your test suite, read the failures, and try again on their own. Feeding test output back into the loop is what makes the agent converge instead of guess.
  3. Gate the PR on the suite. The agent's word that it "fixed the issue" is not evidence. The eval suite passing in CI is. Block the merge on it.
  4. Re-run regression evals on every model upgrade. A new model version is a silent dependency change. Keep regression evals near 100% and run them whenever the underlying model moves, so drift surfaces in CI rather than in production.
  5. Measure reliability, not a lucky run. Agents are non-deterministic, so run each task several times and report pass^k — the chance it passes every time — not pass@k, the chance it passes at least once.

That is eval-driven development with a coding assistant in one paragraph: the assistant proposes, the suite disposes. Everything below is just how to wire that loop into each tool.

Why this is the safety property

A 70%-reliable agent reads as ~97% at pass@3 but only ~34% at pass^3 — across three consecutive runs it fails the spec about two-thirds of the time. If your gate runs each task once and merges on the first green, you are shipping that 34% as if it were 97%. The gate is what makes the codebase safely modifiable by an agent, not the agent's confidence.

Tests-as-spec is the move that makes agents reliable

The cleanest way to use any of these tools is to hand the agent a failing test and ask it to make the test pass — then keep the test as the regression guard forever. This borrows the pattern the SWE-bench family standardized: apply the diff, run the prescribed tests, mark the task resolved only if the fix tests pass and nothing regressed. Two cautions the codex is blunt about:

Per-tool notes (as of 2026, qualitative on purpose)

Capabilities and exact menu names move fast, so what follows is deliberately about the capability rather than a specific button. All three of these tools converge on the same shape: an agent mode that can run your test/eval suite and iterate, plus a CI path that gates the resulting PR. Pick on ergonomics; the eval discipline is what carries the quality.

ToolAgent/auto capabilityHow EDD attaches
Claude Code A terminal-native agent that edits files, runs commands, and works multi-step against your repo. Let it run your test/eval command in the loop so it iterates on real failures; gate the PR it opens with the same suite in CI. Capability and regression evals are the unit of work.
Cursor An agent mode in the editor that plans, edits across files, and can execute the test suite, feeding failures back to itself. Give it the failing test as the spec; require the suite green before you accept the diff; re-run the suite in CI on the branch, not just locally.
GitHub Copilot Agent/auto modes plus a coding agent that can take an issue and open a PR autonomously. Because it produces a PR, the eval gate is a natural fit: a required status check that runs the suite on the agent's branch and blocks the merge below threshold.

Note the common denominator in the right-hand column: in every case the eval suite running in CI is what does the deciding. The differences between the tools are real but live at the level of where you type the instruction, not at the level of what makes the change trustworthy.

Wire the eval gate into the loop and CI

Concretely, the gate is a CI job that runs on the PR the agent opens. It runs the regression suite (must stay ~100%), runs the task's fix tests several times and reports pass^k, and records the capability suite without blocking. The same script the agent runs locally in its loop is the one CI runs on the branch — there is one source of truth for "did it pass."

# .github/workflows/agent-eval-gate.yml
# The agent opened a PR. This is what decides whether it merges.
name: agent-eval-gate
on: [pull_request]

jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: make setup            # build the per-task sandbox

      # 1. Regression evals: things that already work.
      #    Target ~100% pass. Any failure blocks the merge.
      - run: ./run-evals --suite regression --trials 5 --min-pass 1.00

      # 2. The fail_to_pass tests the agent was told to satisfy,
      #    run 5x so a lucky single green run can't sneak through.
      - run: ./run-evals --suite task --trials 5 --report pass^k

      # 3. Capability evals: hard tasks you can't pass yet.
      #    Recorded, never blocking. Watch the number climb.
      - run: ./run-evals --suite capability --report-only

Keep the two suites with opposite targets. Regression evals are things that already work; hold them near 100% and block any change that breaks them — this is exactly the check that catches a model upgrade quietly degrading behavior. Capability evals are harder tasks you can't pass yet; let them start low as bets on what's becoming possible, record them, and watch the number climb before you raise the agent's autonomy.

Where an LLM judge fits (and where it doesn't)

Most of what matters about a code change is verifiable by running code, so keep the gate on code-based graders. Reserve an LLM-as-judge for the genuinely unverifiable — "is this PR description accurate?", "is this refactor readable?" — and treat that judge as code you have to test: pin a rubric, validate it against your own labels until it agrees, and never let it decide something a test could decide. The point of the assistant-as-generator / suite-as-verifier split is that the verifier is trustworthy; a sloppy judge quietly breaks it.

Checklist — EDD with any coding assistant
  • Hand the agent a failing test as the spec; keep it as the regression guard afterward.
  • Let the agent run the test/eval suite in its loop so it iterates on real failures.
  • Run each task multiple trials; gate on pass^k, not a single green run.
  • Make the eval suite a required CI status check on the agent's PR — the suite decides, not the agent.
  • Same script local and in CI; one source of truth for "did it pass."
  • Isolate the sandbox; tests read-only to the agent so it can't hack the grader.
  • Re-run regression evals on every model upgrade; keep them ~100%, capability evals start low.
  • LLM-judge only for the unverifiable, and validate it before it gates anything.

The tool you pick will change again next quarter. The eval suite is the part that lasts — and it is what lets you swap generators without re-earning trust in your codebase.

See also: how to write evals for an AI coding agent for building the suite this article assumes, how to use evals to make a codebase safe for AI to modify for the safety framing, and regression evals: catching AI agent drift for the model-upgrade gate.

Grounded in the EDD codex — Part III (execution-based grading, fail_to_pass/pass_to_pass, weak tests and contamination), Part IV (agent harnesses, outcome-over-path grading, pass^k reliability), and Part VI (error analysis, CI/regression gates, the generator/verifier asymmetry).