How-to

How to build an eval harness for an LLM app

By Brenn Hill · Updated June 2026

An eval harness is the machine that turns "looks good" into a number you can gate a release on. For an LLM app it has six moving parts: a dataset, a set of graders, a loop that runs and scores, a way to aggregate that admits uncertainty, a CI gate, and an online feedback channel. This is the blueprint, built in the order you should build it.

1 — Build the dataset from real traffic, not imagined cases

The highest-leverage move is not infrastructure — it is error analysis. Pull twenty to fifty real interactions from your logs, support queue, and bug tracker, read them, and write a short note on the first thing that goes wrong in each trace. Group those notes into a failure taxonomy. Your first dataset should be those failures. The spec is discovered by reading outputs, not invented at a whiteboard, so write graders for the errors you find, not the ones you imagine. Keep growing the set: every production failure becomes tomorrow's golden-set case, and the dataset grows alongside the app. Store it as plain lines of input plus expected behavior so it is portable and diffable in version control.

2 — Choose graders, layered cheapest-first

A grader is the function that turns one output into a score. Stack three layers and stop at the cheapest one that can decide the case:

Layer	What it grades	Cost & caveat
Code assertions	Anything deterministic: format, schema, a blocked pattern, an exact value, latency.	Cheap, fast, reproducible. Catches a large share of obvious failures before you pay for a model. Always reach here first.
LLM-as-judge	The unverifiable: is this faithful to the source, is this on-topic, is this a good explanation.	Approximates human preference (roughly 80% agreement) but carries position, verbosity, and self-preference bias. Treat the judge prompt as code.
Human review	The judge's own calibration, plus ambiguous or high-stakes cases.	Slow and expensive; reserve it. Use it to label the data that validates everything above it.

Prefer binary pass/fail with a written critique over a one-to-five scale — arbitrary numeric scores are a common sign of a weak eval. And never point a judge at something a test could decide; a deterministic check is cheaper, faster, and not subject to bias.

3 — The harness loop: run, score, aggregate, report

The loop is small. For each case in the dataset, run the app, apply the graders, and record the score. The one rule that separates a real harness from a demo: run each case several times. LLM outputs are non-deterministic, so a single green run is weak evidence. Aggregate to a pass rate, and report it the way a statistician would — with a sample count and a standard error, e.g. "82% (1.4%), n=250" — not as a bare number that looks far more precise than it is. When you compare two versions, compare them case-by-case (a paired difference), because aggregate wins can hide per-input regressions. Pin and version everything: the prompts, the dataset, the grader code, and the model identifier, so that a "pass" is reproducible next month.

One more thing the harness must expose: the transcripts. The eval system is the most common failure point in the whole stack — graders are wrong more often than you would guess. There is a documented case of a capable model scoring 42% on a benchmark until the grading bugs were fixed, after which it jumped to 95%. Read the passes and the failures, not just the summary number.

4 — Gate in CI: split regression from capability

Wire the harness into the pipeline so it runs on every prompt change, code change, and model upgrade — these are unit tests for your LLM application. Keep two kinds of evals with opposite targets:

Regression evals — things that already work. Hold them near 100% pass and block any change that drops them. This is your first line of defense against silent drift when a provider quietly updates a model behind the same name.
Capability evals — hard cases you cannot pass yet. Let them start low, as bets on what is becoming possible, and watch the number climb release over release.

Set explicit thresholds and let the build fail below them. Because verifying a solution is generally easier than producing one, a grader does not need to be as strong as the model it guards to be a useful gate — that asymmetry is the whole reason this works.

5 — Close the loop with online evals

Offline evals catch known regressions and let you iterate fast; they cannot see novel inputs, distribution shift, or a silent provider model change. So sample a slice of production traffic — commonly something like five to ten percent — and score it asynchronously with the same code checks and judges you run offline. This depends on tracing: you cannot score what you did not capture, so instrument requests (OpenTelemetry-style spans work well) and attach scores to them, ideally scoring the retrieval step and the generation step separately. Mine the cheap implicit signals too — retries, regenerations, edits — since only a small fraction of users ever click a thumbs-up. Use offline to go fast; use online to be right. Every novel production failure flows back into step 1 as a new golden-set case.

If your app does retrieval, you get a reference-free rubric for free: context relevance, groundedness, and answer relevance can be graded without gold answers. Just remember to meta-evaluate the evaluator — reference-free judges have been shown to overlook real failure modes, and answer relevance is not the same as answer correctness.

# evals/summarize.yaml  — a portable, config-as-spec eval
dataset: data/golden_set.jsonl   # one case per line; built from real traffic
prompt: prompts/summarize.txt
provider: your-model-under-test

graders:                         # layered, cheapest first
  - type: code                   # deterministic assertions
    name: no_leaked_id
    assert: output_does_not_match("[0-9a-f]{8}-[0-9a-f]{4}")
  - type: code
    name: length_ok
    assert: word_count_between(40, 120)
  - type: llm_judge              # only for what code cannot decide
    name: faithful_to_source
    rubric: prompts/faithfulness_rubric.txt
    temperature: 0
    pass_if: binary_yes

run:
  samples: 5                     # repeat each case; report a rate plus error bars
  report: [pass_rate, std_error, n]

gate:
  regression: 0.99               # already-working cases must stay near 100%
  capability: 0.40               # hard cases start low; watch the number climb

6 — Tooling: pick a category, favor portable OSS

There is no single eval tool. The space splits into layers, and most teams need at least two: a lightweight CI/test framework that fails the build on regression, plus an observability platform that captures production traces and scores them. Choose by category, not by brand:

CI / offline test frameworks — the ones that gate the build. Promptfoo (config-as-spec, also red-teaming), DeepEval (pytest-native), and Inspect AI (strong for agentic and sandboxed evals) live here; Ragas covers the retrieval-specific metrics.
Observability / tracing platforms — the ones that capture and score live traffic. Langfuse, Arize Phoenix, and Braintrust are common; their offline-eval depth is generally shallower than a dedicated framework.

Two cautions from the field. Read the license, not the marketing — "open core" often means the collaboration features are paid, and some "open" platforms are source-available rather than truly OSS. And do not assume a famous harness is actively maintained; flagship projects have gone quiet or entered maintenance mode. Favor portable, OSS-licensed building blocks for the spec layer so a vendor change does not strand your evals. The load-bearing capability is simply a versioned dataset plus a scorer suite that runs deterministically in CI and fails on regression — keep that portable and the rest is interchangeable.

The one-page checklist

Dataset of 20–50 real cases from logs and bug trackers; grow it from every new failure.
Graders layered cheapest-first: code assertions, then validated LLM-judge, then human.
Prefer binary pass/fail with a critique over arbitrary 1–5 scores.
Run each case multiple times; report pass rate with sample count and error bars.
Pin and version prompts, data, graders, and model id so a pass is reproducible.
Read transcripts of passes and failures — the grader is the likeliest bug.
CI gate: regression evals near 100%, capability evals start low.
Sample production traffic for online evals; feed failures back into the dataset.
Pick tools by category (CI framework plus observability); favor portable OSS.

Where this goes next

Once the harness exists it becomes the substrate for everything else: it tells you when a model upgrade is safe, when to raise an agent's autonomy, and whether yesterday's fix stayed fixed. But an honest, all-green suite is necessary, not sufficient — it is a finite, closed spec, so pair it with held-out tests and real-world feedback and rotate the set so it does not become the only thing you optimize.

See also: writing evals for an AI coding agent for the execution-based case where the tests are the spec, the EDD overview for why this is the discipline that makes a codebase safely modifiable by AI, and the codex (Parts I, V, VI, and VII) for the evidence behind each step.

Newsletter

Get new eval-driven development essays by email

Practical evals for AI-assisted and agent code — the executable spec and the guardrail, not the vibe check. No spam, unsubscribe anytime.

Grounded in the EDD codex — Part I (evals as experiments, error bars, offline vs online), Part V (RAG triad, production/online evals, tracing), Part VI (error analysis, layered graders, CI gates, reading transcripts), and Part VII (the CI-framework vs observability tooling split, licensing and maintenance caveats).