How-to
How to use evals to make a codebase safe for AI to modify
The reason most teams won't let an agent touch their codebase unattended is simple: they have no way to know the change is safe. Evals are that way. They turn "I hope the agent didn't break anything" into a property you can check before the diff merges — and a strong enough check is what earns the agent permission to modify code in the first place.
The safety property. A change is acceptable if and only if it passes the evals. That biconditional is the whole game. When your evals genuinely encode what "not broken" means, "does it pass?" becomes a trustworthy proxy for "is it safe to merge?" — and the strength of your eval suite is exactly the amount of autonomy you can responsibly hand an agent.
Why an executable check, and not a review, is the unit of safety
Human review doesn't scale to the rate at which an agent produces diffs, and it's the wrong tool for the most dangerous failures — the silent ones. The codex's founding move for code evals is execution-based grading: a change is correct iff it passes the tests, not iff it looks right. For coding work this is unusually clean, because the tests already are a partial spec. The job is to make that spec complete enough that passing it means safe.
The contract: comprehensive execution tests
Borrow the pattern the SWE-bench lineage standardized. Every acceptable change must satisfy two opposing invariants at once:
- The fix tests must pass (
fail_to_pass) — proof the change delivered the new behavior it was asked for. - The regression tests must stay green (
pass_to_pass) — proof it didn't quietly break anything that already worked.
The second invariant is the safety wall. An agent that adds a feature is easy; an agent that
adds a feature without regressing the ledger, the auth path, or the three callers
downstream of the function it edited is what "safe to modify" actually means. The regression
set is your standing guarantee, and it's why a comprehensive pass_to_pass list
matters far more than a clever fail_to_pass one.
Behavioral and grounding evals for what tests can't express
Execution tests cover most of what matters about a code change, but not all of it. Some safety-relevant properties aren't expressible as an assertion: did the pull-request description name the real cause or invent a plausible-sounding one? Is the refactor's contract — its input-validation and error-handling behavior — actually preserved, the subtle kind of drift that traditional metrics like cyclomatic complexity don't catch but that makes the next change harder? For these, add behavioral and grounding evals: a validated LLM-as-judge for the unverifiable, and trajectory checks for process constraints ("must not touch the billing module"). Keep everything a test can decide on a test. A judge is code you have to validate against your own labels before it's allowed to gate anything.
Don't hand the agent the answer — and don't let it grade itself
A safety check is only as honest as it is hard to game, and the codex documents three ways a green suite lies to you. All three are construction errors you control.
- Don't leak the answer into the task. A re-audit of "resolved" SWE-bench cases found roughly 60% had the fix sitting in the issue text or comments — those evals were grading reading comprehension, not engineering. If the solution is in the prompt, a pass tells you nothing about whether the agent can safely modify code it hasn't been handed the answer to.
- Harden the tests so a plausible-but-wrong patch fails. The same audit found ~48% of cases passed only because the suite was too weak to reject an incorrect patch. Weak tests are worse than no tests, because they grant permission on false evidence. Strengthen the grader until a patch that looks right but is wrong actually goes red.
- Isolate the sandbox so the agent can't game the grader. If the agent can edit the tests, the fixtures, or the environment, a sufficiently capable one will — reward hacking is the expected behavior, not the exception. Run the contract in a container where the test files and the grading harness are out of the agent's reach.
Gate autonomy on reliability, not a lucky run
Agents are non-deterministic, so one green run is weak evidence — exactly the evidence you must not grant autonomy on. The distinction the codex draws is between capability and reliability: pass@k asks whether the agent can pass (any of k tries succeeds); pass^k asks whether it will (every one of k tries succeeds). A 70%-reliable agent reads as ~97% at pass@3 but only ~34% at pass^3. For anything you'd let run unattended against your codebase, pass^k is the number that predicts how often you'll be cleaning up a bad merge.
| Question | Metric | What it licenses | Safe to autonomize on? |
|---|---|---|---|
| Can the agent ever do this? | pass@k | A capability bet | No |
| Will the agent always do this? | pass^k | Unattended modification | Yes |
Pair the binary gate with partial credit for diagnosis. An agent that localizes the bug but botches the fix is further along than one that flails — and on a long task, "found the cause, didn't merge" is the right outcome: useful signal, no unsafe change. Partial credit is for triaging the agent's work, not for lowering the merge bar.
A sample acceptance contract
Putting the pieces together, the contract that licenses an agent to modify a module looks like this — execution invariants, behavioral checks, and a reliability requirement in one artifact:
# acceptance contract for: "agent may modify this module"
# a change is acceptable iff ALL of the following hold
fail_to_pass: # the new behavior the change must deliver
- tests/test_invoices.py::test_partial_refund_rounds_down
pass_to_pass: # the regression wall — must STAY green
- tests/test_invoices.py::test_full_refund
- tests/test_invoices.py::test_no_negative_balance
- tests/test_ledger.py::* # everything downstream of the change
behavioral: # what the tests can't express, graded separately
- grounding: "PR description names the real cause, not a guess"
- trajectory: "must NOT edit billing/auth/ or the test files"
reliability:
trials: 8 # run the whole contract 8x
require: pass^8 # accept only if it passes EVERY time, not once
on_partial: diagnose # localized-but-unfixed earns partial credit, not merge Read it as a permission slip: the agent may change this module, and the change is accepted only when every line holds, eight times running. Nothing about the agent's confidence, its explanation, or a single happy-path run enters into it.
Evals are necessary, not sufficient — pair them
An honest, all-green suite can still ship a broken product, because a static eval set is a finite, closed spec. Evals are one corner of making a codebase safely modifiable by AI, not the whole structure. They pair with two companions: human oversight (LoopRails) for the judgment calls and escalations evals can't encode, and agent security (BRACE) for the threat model — a sandbox that contains a misbehaving or compromised agent so it can't reach past the grader in the first place. Evals are the executable contract; oversight is the human in the loop; security is the blast radius. The three together are what make code safely modifiable by AI.
- State the property out loud: a change is acceptable iff it passes the evals.
- Every change satisfies
fail_to_passand a comprehensivepass_to_passregression wall. - Add behavioral/grounding evals for what tests can't express; validate any judge first.
- No solution leakage — the answer is not in the task prompt.
- Tests hardened so a plausible-but-wrong patch goes red.
- Sandbox isolated — the agent can't edit tests, fixtures, or the grader.
- Gate autonomy on pass^k reliability, not a single lucky run; give partial credit for diagnosis.
- Pair the suite with human oversight and agent security — evals alone are necessary, not sufficient.
Start here: how to write evals for an AI
coding agent builds the harness this contract runs on, and
regression evals keep the
pass_to_pass wall standing after a model upgrade. The
eval-driven development overview and the codex hold the
evidence behind every claim above.
Get new eval-driven development essays by email
Practical evals for AI-assisted and agent code — the executable spec and the guardrail, not the vibe check. No spam, unsubscribe anytime.
Grounded in the EDD codex — Part III (execution-based grading,
fail_to_pass/pass_to_pass, solution leakage and weak tests,
maintainability/contract debt), Part IV (reliability, pass@k vs pass^k, partial credit,
trajectory checks, reward hacking), and Part VI (the practice — CI gates, regression vs
capability evals, evals as necessary-not-sufficient).