Eval-Driven Development

How-to

How to use evals to make a codebase safe for AI to modify

The reason most teams won't let an agent touch their codebase unattended is simple: they have no way to know the change is safe. Evals are that way. They turn "I hope the agent didn't break anything" into a property you can check before the diff merges — and a strong enough check is what earns the agent permission to modify code in the first place.

The safety property. A change is acceptable if and only if it passes the evals. That biconditional is the whole game. When your evals genuinely encode what "not broken" means, "does it pass?" becomes a trustworthy proxy for "is it safe to merge?" — and the strength of your eval suite is exactly the amount of autonomy you can responsibly hand an agent.

Why an executable check, and not a review, is the unit of safety

Human review doesn't scale to the rate at which an agent produces diffs, and it's the wrong tool for the most dangerous failures — the silent ones. The codex's founding move for code evals is execution-based grading: a change is correct iff it passes the tests, not iff it looks right. For coding work this is unusually clean, because the tests already are a partial spec. The job is to make that spec complete enough that passing it means safe.

The contract: comprehensive execution tests

Borrow the pattern the SWE-bench lineage standardized. Every acceptable change must satisfy two opposing invariants at once:

The second invariant is the safety wall. An agent that adds a feature is easy; an agent that adds a feature without regressing the ledger, the auth path, or the three callers downstream of the function it edited is what "safe to modify" actually means. The regression set is your standing guarantee, and it's why a comprehensive pass_to_pass list matters far more than a clever fail_to_pass one.

Behavioral and grounding evals for what tests can't express

Execution tests cover most of what matters about a code change, but not all of it. Some safety-relevant properties aren't expressible as an assertion: did the pull-request description name the real cause or invent a plausible-sounding one? Is the refactor's contract — its input-validation and error-handling behavior — actually preserved, the subtle kind of drift that traditional metrics like cyclomatic complexity don't catch but that makes the next change harder? For these, add behavioral and grounding evals: a validated LLM-as-judge for the unverifiable, and trajectory checks for process constraints ("must not touch the billing module"). Keep everything a test can decide on a test. A judge is code you have to validate against your own labels before it's allowed to gate anything.

Don't hand the agent the answer — and don't let it grade itself

A safety check is only as honest as it is hard to game, and the codex documents three ways a green suite lies to you. All three are construction errors you control.

  1. Don't leak the answer into the task. A re-audit of "resolved" SWE-bench cases found roughly 60% had the fix sitting in the issue text or comments — those evals were grading reading comprehension, not engineering. If the solution is in the prompt, a pass tells you nothing about whether the agent can safely modify code it hasn't been handed the answer to.
  2. Harden the tests so a plausible-but-wrong patch fails. The same audit found ~48% of cases passed only because the suite was too weak to reject an incorrect patch. Weak tests are worse than no tests, because they grant permission on false evidence. Strengthen the grader until a patch that looks right but is wrong actually goes red.
  3. Isolate the sandbox so the agent can't game the grader. If the agent can edit the tests, the fixtures, or the environment, a sufficiently capable one will — reward hacking is the expected behavior, not the exception. Run the contract in a container where the test files and the grading harness are out of the agent's reach.

Gate autonomy on reliability, not a lucky run

Agents are non-deterministic, so one green run is weak evidence — exactly the evidence you must not grant autonomy on. The distinction the codex draws is between capability and reliability: pass@k asks whether the agent can pass (any of k tries succeeds); pass^k asks whether it will (every one of k tries succeeds). A 70%-reliable agent reads as ~97% at pass@3 but only ~34% at pass^3. For anything you'd let run unattended against your codebase, pass^k is the number that predicts how often you'll be cleaning up a bad merge.

QuestionMetricWhat it licensesSafe to autonomize on?
Can the agent ever do this?pass@kA capability betNo
Will the agent always do this?pass^kUnattended modificationYes

Pair the binary gate with partial credit for diagnosis. An agent that localizes the bug but botches the fix is further along than one that flails — and on a long task, "found the cause, didn't merge" is the right outcome: useful signal, no unsafe change. Partial credit is for triaging the agent's work, not for lowering the merge bar.

A sample acceptance contract

Putting the pieces together, the contract that licenses an agent to modify a module looks like this — execution invariants, behavioral checks, and a reliability requirement in one artifact:

# acceptance contract for: "agent may modify this module"
# a change is acceptable iff ALL of the following hold

fail_to_pass:          # the new behavior the change must deliver
  - tests/test_invoices.py::test_partial_refund_rounds_down

pass_to_pass:          # the regression wall — must STAY green
  - tests/test_invoices.py::test_full_refund
  - tests/test_invoices.py::test_no_negative_balance
  - tests/test_ledger.py::*          # everything downstream of the change

behavioral:            # what the tests can't express, graded separately
  - grounding: "PR description names the real cause, not a guess"
  - trajectory: "must NOT edit billing/auth/ or the test files"

reliability:
  trials: 8            # run the whole contract 8x
  require: pass^8      # accept only if it passes EVERY time, not once
  on_partial: diagnose # localized-but-unfixed earns partial credit, not merge

Read it as a permission slip: the agent may change this module, and the change is accepted only when every line holds, eight times running. Nothing about the agent's confidence, its explanation, or a single happy-path run enters into it.

Evals are necessary, not sufficient — pair them

An honest, all-green suite can still ship a broken product, because a static eval set is a finite, closed spec. Evals are one corner of making a codebase safely modifiable by AI, not the whole structure. They pair with two companions: human oversight (LoopRails) for the judgment calls and escalations evals can't encode, and agent security (BRACE) for the threat model — a sandbox that contains a misbehaving or compromised agent so it can't reach past the grader in the first place. Evals are the executable contract; oversight is the human in the loop; security is the blast radius. The three together are what make code safely modifiable by AI.

The safe-to-modify checklist
  • State the property out loud: a change is acceptable iff it passes the evals.
  • Every change satisfies fail_to_pass and a comprehensive pass_to_pass regression wall.
  • Add behavioral/grounding evals for what tests can't express; validate any judge first.
  • No solution leakage — the answer is not in the task prompt.
  • Tests hardened so a plausible-but-wrong patch goes red.
  • Sandbox isolated — the agent can't edit tests, fixtures, or the grader.
  • Gate autonomy on pass^k reliability, not a single lucky run; give partial credit for diagnosis.
  • Pair the suite with human oversight and agent security — evals alone are necessary, not sufficient.

Start here: how to write evals for an AI coding agent builds the harness this contract runs on, and regression evals keep the pass_to_pass wall standing after a model upgrade. The eval-driven development overview and the codex hold the evidence behind every claim above.

Grounded in the EDD codex — Part III (execution-based grading, fail_to_pass/pass_to_pass, solution leakage and weak tests, maintainability/contract debt), Part IV (reliability, pass@k vs pass^k, partial credit, trajectory checks, reward hacking), and Part VI (the practice — CI gates, regression vs capability evals, evals as necessary-not-sufficient).