Eval-Driven Development

Research reference

The eval-driven development codex

A living, citation-grounded reference assembled to ground the practice of eval-driven development. It captures what is known to work, what people tried that failed, and the measurement and statistical foundations the practice rests on. Sources were retrieved and verified via research; specifics that could not be fully verified are flagged inline.

Read the full codex (Markdown) → View on GitHub →

The eight parts

  1. Part I Foundations of LLM & AI-system evaluation Vocabulary, metrics, and the statistics that make "evals as spec" credible — eval vs test vs benchmark, pass@k, error bars, prompt-format fragility.
  2. Part II LLM-as-judge & graded evaluation Using a model to grade: pointwise vs pairwise, rubric design, position/verbosity/self-preference bias, and when not to trust the judge.
  3. Part III Code generation & coding agents Execution-based grading and unit-tests-as-spec; the HumanEval → SWE-bench lineage; contamination, weak tests, and what they miss.
  4. Part IV Agents: trajectories, tool use & capability Outcome vs trajectory grading, state comparison, reliability (pass^k), capability/time-horizon evals, and why the harness is the usual failure point.
  5. Part V RAG, production & online evaluation The RAG triad (context relevance, groundedness, answer relevance), runtime guardrails, and the offline → online loop.
  6. Part VI Eval-driven development as a practice The loop itself: error analysis, the layered stack, CI gates, regression vs capability evals, and the TDD analogy (and its limits).
  7. Part VII The eval tooling landscape A vendor-neutral survey — Promptfoo, Inspect AI, DeepEval, Braintrust, Langfuse, Ragas, Phoenix, and more — and how to choose.
  8. Part VIII Validity, contamination & failure modes How evals mislead: contamination, saturation, Goodharting, sandbagging, leaderboard illusions — and why a green suite can still ship a broken product.

The load-bearing synthesis

One thread runs through all eight parts: an eval is a runnable experiment that encodes a specification and returns a measurable pass signal — and it is only as trustworthy as its construct validity, its data hygiene, and the rigor of how you read its result. The principles that follow are the distilled spine of the codex.

  1. An eval is an experiment, not a vibe — dataset + success criterion + grader, read as an estimate with uncertainty.
  2. Eval ≠ test ≠ benchmark ≠ metric. A high benchmark score is not your app passing its evals.
  3. Execution-based grading is the gold standard where it exists — for code, the tests are the spec.
  4. Capability vs reliability: pass@k (can it ever?) vs pass^k (will it every time?). Ship on reliability.
  5. LLM-as-judge approximates human preference — only if validated, de-biased, and kept off verifiable correctness.
  6. The spec is discovered, not pre-written ("criteria drift"): build evals from real failures via error analysis.
  7. Layer the stack and gate in CI: code assertions → judge → human; regression evals near 100%, capability evals start low.
  8. The generator/verifier asymmetry is why EDD works — verifying is easier than generating.
  9. Offline to go fast, online to be right — close the loop with production traffic.
  10. Agents are graded over whole trajectories, with partial credit; the harness is the usual failure point.
  11. RAG decomposes into a triad you can spec reference-free — but meta-evaluate the evaluator.
  12. No single tool; pair a CI eval framework with an observability platform, and favor portable OSS.
  13. Every eval is a Goodhart target — contamination, saturation, gaming, sandbagging, style-biased judges.
  14. Read the result like a statistician — error bars, multiple samples, a range over prompt formats.
  15. Evals are necessary, not sufficient — pair with held-out/dynamic tests and rotate so the spec isn't the optimization surface.
How to read it Each part opens with Highlights (the load-bearing takeaways) followed by an annotated bibliography. Citations are keyed per part ([F-n], [J-n], …) and tagged [EMPIRICAL], [VENDOR], [PRACTITIONER], or [POSITION]. Documented failures sit beside successes by design.