Eval-Driven Development

Practice

Regression evals: catching AI-agent drift

Your agent worked yesterday. The same task fails today, and nothing in your code changed. That gap is drift: the behavior of an AI system shifts underneath you, quietly, between the moment it passed review and the moment a user hits the regression. Regression evals are the standing guard that catches it first.

Where drift comes from

A deterministic service only changes when you change it. An agent has at least four other surfaces that can move without a single line of your code being touched:

Only one of these four shows up in a diff. The other three are exactly why "it passed when I merged it" is not a durable guarantee, and why you need a check that runs on a clock as well as on a commit.

Regression evals vs capability evals

An eval suite for an agent really wants to be two suites with opposite target scores. The distinction comes straight from frontier-lab practice and it is the backbone of this whole article.

Regression evalsCapability evals
Behaviors that already workBehaviors you can't do yet
Target: near 100% pass — keep it thereTarget: starts low, a bet on the next few months
Any drop is a defect to blockA rise is progress to celebrate
Runs on every change and on a scheduleRuns to track the climb over time

Regression evals "should maintain nearly 100% pass rate" so that backsliding is loud. Capability evals deliberately start low — they are bets on what the model will be able to do soon, and you watch the number climb. Same harness, opposite expectations. Confuse the two and you either tolerate regressions or panic over a capability eval that was always meant to be red.

The scheduled run is the part teams skip and regret. A pull-request gate only fires when you change something. A nightly run against a fixed dataset is what catches the silent provider update — the case where your code is frozen and the behavior moved anyway.

Building the golden set: every failure becomes a permanent case

A regression suite is only as good as its memory. The discipline is simple and unforgiving: every real failure becomes a permanent case in the golden set, so the exact thing that broke can never silently regress again. Bug tracker, support queue, production traces, the agent's own bad transcripts — each one becomes a task with a starting state, an instruction, and a grader made of tests. The dataset grows alongside the agent. A good start is twenty to fifty tasks drawn from real failures, and it only compounds from there.

Grade the outcome, not the path. For a coding agent the tests are the spec, so a task passes when its fail_to_pass tests pass and its pass_to_pass tests stay green. For stateful agents, compare the final world-state to an annotated goal state rather than reading the transcript — it sidesteps brittle text matching. Resist asserting an exact sequence of tool calls: agents routinely find valid approaches you didn't anticipate, and path-matching produces evals that fail on good work.

The detail that makes regression evals trustworthy is reliability. Agents are non-deterministic, so a single green run is weak evidence. Run each task several times and report pass^k — the probability it passes every time — not pass@k, the probability it passes at least once. A 70%-reliable agent reads as roughly 97% at pass@3 but only about 34% at pass^3. For a regression gate, pass^k is the honest number: it tells you whether the behavior will hold, not merely whether it can. On τ-bench, pass^8 fell below 25% in retail even for capable models — a gap a single run would have hidden completely.

Online drift monitoring: the other half of the loop

Offline regression evals are a closed, finite spec. They catch the failures you have already seen. They cannot, by construction, see a failure mode you have not yet captured. That is what online evaluation is for. Sample a slice of real production traffic — commonly around 5 to 10% — and score it asynchronously with code checks and validated judges, watching for drift, novel inputs, and silent provider model updates that your fixed dataset might not exercise.

The two layers are complementary, not redundant: use offline to go fast, use online to be right. The loop closes when a production failure becomes tomorrow's golden-set case — the online layer discovers the regression, the offline layer makes sure it never comes back. Tracing is the substrate that makes this possible; you can only score the execution you captured, so instrument the retrieval step and the generation step separately and attach scores to the spans.

What to gate, and what to alert

Not every signal deserves the power to block a merge. The rule of thumb:

  1. Gate on regression evals. They are near-100% by definition, so a drop is a clear, deterministic defect. Block the merge; on the scheduled run, page someone.
  2. Alert on capability evals. A red capability eval is the expected state, not a failure. Track the trend; never let it block a deploy.
  3. Gate only what code can grade cleanly. Deterministic checks and validated graders gate. An unvalidated LLM judge does not earn gate authority until it agrees with your own labels.
  4. Alert on online drift, then triage. Sampled-production scores are noisy proxies; a dip opens an investigation and feeds the golden set — it does not auto-block, because the input distribution is always moving.

Here is a minimal CI-plus-schedule config: a gated regression job that runs on every pull request and nightly, and an alert-only capability job alongside it.

# .github/workflows/agent-evals.yml
name: agent-evals
on:
  pull_request:        # every prompt, tool, or dependency change
  schedule:
    - cron: "0 7 * * *"  # daily — catch silent provider model updates

jobs:
  regression:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run the golden set
        run: |
          edd run --suite regression \
                  --trials 5 \            # report pass^5, not a lucky run
                  --gate pass_rate>=0.98   # block the merge / page on schedule
  capability:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run the stretch set (alert-only)
        run: edd run --suite capability --trials 5 --no-gate

One caution that costs teams the most time: the harness is the most common failure point, not the model. Read the transcripts when a regression eval flips red. A grader bug can drop a score as easily as real drift can — in one documented case a capable model scored 42% on a benchmark until the grading bugs were fixed, then jumped to 95%. Before you blame a provider for drift, confirm your own grader did not move.

Regression-eval checklist
  • Two suites: regression (target ~100%, gate) and capability (starts low, alert-only).
  • Run on every change and on a schedule, to catch silent provider updates.
  • Every real failure becomes a permanent golden-set case so it can't regress again.
  • Grade the outcome (tests, or final-state comparison); avoid exact tool-call matching.
  • Report pass^k across multiple trials, not a single lucky run.
  • Add online evals on ~5–10% of traffic; close the loop into the golden set.
  • Gate deterministic checks; alert on judges and online drift until validated.
  • When an eval flips red, read the transcript — rule out a grader bug before blaming drift.

Where this fits

Regression evals are the standing guarantee that an agent which worked yesterday still works today, across every surface that can move without your involvement. They are also the property that lets an agent change your codebase without quietly breaking it. If you are building this from scratch, start with the harness, then split it into these two suites.

See also: how to write evals for an AI coding agent for the harness this builds on, using evals to make a codebase safe for AI to modify for the safety payoff, and the overview of eval-driven development for how the pieces connect. For keeping a human in the loop on what the agent is allowed to do unattended, looprails.dev covers the oversight side.

Grounded in the EDD codex — Part VI (regression vs capability evals, CI and scheduled gates, golden sets, reading transcripts), Part IV (agent trajectories, outcome grading, pass^k reliability), and Part V (online evals on sampled traffic, tracing, the offline→online loop).