Eval-Driven Development

Glossary

Eval-driven development glossary

The core vocabulary of EDD, in plain language. For where each idea comes from and the evidence behind it, see the codex.

Eval
A runnable experiment that checks an AI system: a dataset of inputs, a success criterion, and a grader. Unlike a test, it grades non-deterministic output and its result is a statistical estimate, not a single pass/fail.
Eval-driven development (EDD)
Using evals as the executable spec and guardrail for AI-assisted and agent software: define the evals, then have the AI iterate until they pass.
Grader
The thing that decides pass or fail for an eval case. Three kinds, cheapest first: code/execution, LLM-as-judge, and human review.
Golden set
The curated dataset of inputs (and expected behaviour) an eval suite runs against. Best grown from real failures and production traces, not imagined cases.
LLM-as-judge
Using a language model to grade output against a rubric. Useful for subjective quality, but biased (position, verbosity, self-preference) and unreliable on verifiable correctness — so it must be validated against human labels.
pass@k
The probability that at least one of k attempts succeeds. Measures capability — whether a system can ever produce a correct result.
pass^k
The probability that all k attempts succeed. Measures reliability — whether a system does the task every time. The honest number for anything you ship.
Regression eval
An eval for behaviour that already works, kept near 100% pass to catch silent backsliding from a prompt change, model upgrade, or dependency update.
Capability eval
An eval for a behaviour you cannot reliably do yet. It deliberately starts at a low pass rate and is tracked as a bet on what is becoming possible, rather than blocking the build.
Error analysis
Reading real outputs and categorizing how they fail, until no new failure type appears. The highest-leverage activity in EDD — the eval set is built from what you find.
Criteria drift
The catch-22 that you need criteria to grade outputs, but grading is what reveals the real criteria. It means eval rubrics are discovered and iterated, not fully written up front.
Benchmark
A standardized dataset, metric, and protocol for comparing models (for example MMLU or SWE-bench). Distinct from an eval, which is specific to your application.
Contamination
When benchmark or eval data leaks into a model’s training data, inflating scores through memorization rather than genuine capability. A reason to prefer fresh, private, or held-out evals.
Goodharting (eval gaming)
From "when a measure becomes a target, it ceases to be a good measure": once an eval is optimized against, it can be satisfied without the underlying quality — via contamination, reward hacking, or style-over-substance.
RAG triad
Three reference-free axes for evaluating retrieval-augmented generation: context relevance (retrieval), groundedness/faithfulness (claims trace to the context), and answer relevance.
Trajectory evaluation
Grading the steps an agent takes (tool calls, order, state changes), as opposed to only its final outcome. Use it when the process matters; otherwise grade the outcome and allow partial credit.
Online eval
Scoring a sample of live production traffic, as opposed to offline evals against a fixed set. Catches drift and novel failures; its findings feed back into the golden set.
Eval harness
The machinery that runs an eval suite: provides inputs and tools, executes cases (often repeatedly and in parallel), records every step, grades, and aggregates. Frequently the biggest source of misleading results.

Definitions distilled from the EDD codex.