Free kit
The eval-driven development kit
Three copy-paste artifacts to start practising EDD today: a one-page checklist, a starter eval suite, and an LLM-as-judge rubric. Steal them, fill them in before you let an agent run. Each is free to download — no gate.
Get the kit and new essays by email
Grab the templates below, and get each new how-to, comparison, and checklist as it ships. No spam, unsubscribe anytime.
1 · The one-page checklist
A pre-flight before you ship an AI feature or let an agent change code. Download .md →
2 · A starter eval suite
The shape that matters — dataset + criteria + graders, read statistically. Adapt it to your harness (how to build one). Download .yaml →
# Starter eval suite — tool-agnostic. Fill in and wire into CI.
suite: my-feature
dataset:
# Build this from REAL failures and production traces, not imagined cases.
cases:
- id: case-001
input: "…the real input that failed…"
context: "…retrieved docs / state, if any…"
criteria:
- check: "output is valid, parseable JSON"
grader: code # cheapest, most reliable
- check: "answer is faithful to the provided context"
grader: llm-judge # validate vs human labels first
- check: "no personal data is leaked"
grader: code
run:
trials: 5 # run each case repeatedly…
report: pass^k # …and report reliability, not a lucky run
gate:
regression: ">= 100% pass" # already-working behaviours — block on break
capability: track # harder bets — measure the trend, don't block
online:
sample: 0.05 # also score ~5% of live traffic
feed_failures_back_into: dataset # prod failures become golden-set cases 3 · An LLM-as-judge rubric
For the subjective quality a deterministic check can't decide. Pin temperature to 0, randomize order, and validate it before you trust it. Download .md →
ROLE
You are evaluating the output of <task>. First reason briefly, then a verdict.
Do not reward longer answers or the answer shown first.
REFERENCE (when available)
<the known-good answer, or the source the output must be faithful to>
CRITERIA — judge each PASS or FAIL with a one-line reason:
1. <criterion> — PASS if <concrete, observable anchor>; otherwise FAIL.
2. <criterion> — PASS if <concrete, observable anchor>; otherwise FAIL.
OUTPUT
reasoning: <2–3 sentence critique>
verdict: pass # pass only if all required criteria pass
VALIDATE before automating: have one domain expert grade 30–50 outputs;
iterate this prompt until judge-vs-expert agreement is high (kappa); re-check for drift. How to use the kit
Start with the definition if EDD is new to you. Then the checklist is your map; the suite is the thing you wire into CI per how to write evals for a coding agent; and the rubric is for the parts only a judge can grade — see writing grading rubrics for agent behavior. Assess where you stand with the maturity scorecard.
Get new eval-driven development essays by email
Practical evals for AI-assisted and agent code — the executable spec and the guardrail, not the vibe check. No spam, unsubscribe anytime.
The reasoning behind every line here — and 130+ cited sources — is in the EDD codex.