Glossary

Eval-driven development glossary

The core vocabulary of EDD, in plain language. For where each idea comes from and the evidence behind it, see the codex.

Eval: A runnable experiment that checks an AI system: a dataset of inputs, a success criterion, and a grader. Unlike a test, it grades non-deterministic output and its result is a statistical estimate, not a single pass/fail.
Eval-driven development (EDD): Using evals as the executable spec and guardrail for AI-assisted and agent software: define the evals, then have the AI iterate until they pass.
Grader: The thing that decides pass or fail for an eval case. Three kinds, cheapest first: code/execution, LLM-as-judge, and human review.
Golden set: The curated dataset of inputs (and expected behaviour) an eval suite runs against. Best grown from real failures and production traces, not imagined cases.
LLM-as-judge: Using a language model to grade output against a rubric. Useful for subjective quality, but biased (position, verbosity, self-preference) and unreliable on verifiable correctness — so it must be validated against human labels.
pass@k: The probability that at least one of k attempts succeeds. Measures capability — whether a system can ever produce a correct result.
pass^k: The probability that all k attempts succeed. Measures reliability — whether a system does the task every time. The honest number for anything you ship.
Regression eval: An eval for behaviour that already works, kept near 100% pass to catch silent backsliding from a prompt change, model upgrade, or dependency update.
Capability eval: An eval for a behaviour you cannot reliably do yet. It deliberately starts at a low pass rate and is tracked as a bet on what is becoming possible, rather than blocking the build.
Error analysis: Reading real outputs and categorizing how they fail, until no new failure type appears. The highest-leverage activity in EDD — the eval set is built from what you find.
Criteria drift: The catch-22 that you need criteria to grade outputs, but grading is what reveals the real criteria. It means eval rubrics are discovered and iterated, not fully written up front.
Benchmark: A standardized dataset, metric, and protocol for comparing models (for example MMLU or SWE-bench). Distinct from an eval, which is specific to your application.
Contamination: When benchmark or eval data leaks into a model’s training data, inflating scores through memorization rather than genuine capability. A reason to prefer fresh, private, or held-out evals.
Goodharting (eval gaming): From "when a measure becomes a target, it ceases to be a good measure": once an eval is optimized against, it can be satisfied without the underlying quality — via contamination, reward hacking, or style-over-substance.
RAG triad: Three reference-free axes for evaluating retrieval-augmented generation: context relevance (retrieval), groundedness/faithfulness (claims trace to the context), and answer relevance.
Trajectory evaluation: Grading the steps an agent takes (tool calls, order, state changes), as opposed to only its final outcome. Use it when the process matters; otherwise grade the outcome and allow partial credit.
Online eval: Scoring a sample of live production traffic, as opposed to offline evals against a fixed set. Catches drift and novel failures; its findings feed back into the golden set.
Eval harness: The machinery that runs an eval suite: provides inputs and tools, executes cases (often repeatedly and in parallel), records every step, grades, and aggregates. Frequently the biggest source of misleading results.

Newsletter

Get new eval-driven development essays by email

Practical evals for AI-assisted and agent code — the executable spec and the guardrail, not the vibe check. No spam, unsubscribe anytime.

Definitions distilled from the EDD codex.