Articles
Articles
Plain, practitioner pieces on building AI-assisted software measurably — the spec and the guardrail, not the vibe check.
What is eval-driven development?
The canonical definition: evals as the executable spec and guardrail for AI-assisted software.
Eval-driven development vs. test-driven development
What carries over from TDD, what breaks, and how evals and tests work together.
Eval-driven development vs. TDD and BDD
Where EDD sits in the driven-development lineage — and why it is closest to BDD.
Evals vs. tests vs. benchmarks: what’s the difference?
Four different things, often conflated — and which one is actually your spec.
Why unit tests aren’t enough for AI-generated code
Tests stay essential, but AI code needs behavior, grounding, and maintainability checked too.
How to write evals for an AI coding agent
From your first failure log to a CI-gating eval suite.
Eval-driven development with Claude Code, Cursor, and Copilot
Make an eval suite the gate for agent changes, whichever assistant you use.
How to build an eval harness for an LLM app
Datasets from real traffic, layered graders, CI gating, online evals, statistics.
LLM-as-judge evals: when and how (and when not)
When to grade with a model, how to make the judge trustworthy, and when never to.
Writing grading rubrics for agent behavior
Score anchors, binary criteria, outcome vs trajectory, and validating the rubric.
Regression evals: catching AI-agent drift
Model upgrades and prompt tweaks shift behavior silently. Catch it before users do.
How to use evals to make a codebase safe for AI to modify
Evals are the guardrail that lets an agent change your code without breaking it.
An eval-driven development maturity model
A five-level scorecard, from vibe checks to a calibrated, online eval suite.
The eval-driven development codex
130+ annotated, cited sources across eight parts — the research behind the practice.
Tools & resources
The EDD kit
Copy-paste checklist, starter eval suite, and LLM-as-judge rubric. Free to download.
Maturity scorecard
Five questions to find your EDD level — and your next step.
The eval tooling landscape
A vendor-neutral survey of eval tools and how to choose.
FAQ
Short answers to the questions that come up most.
Glossary
Plain-language definitions of the core EDD terms.
Forthcoming
Evals for a support agent: a worked teardown
What evals cost to run — and how to keep it cheap
Get each piece when it ships
The comparisons, the how-tos, the checklists — straight to your inbox. No spam, unsubscribe anytime.