AI
AI evaluation in production: evals as regression tests
A working AI evaluation program: golden sets that gate releases, drift monitoring, human review sampling, and incident thresholds that trigger a rollback.
Executive summary
AI evaluation is the discipline of testing an AI system's behavior against defined expectations, continuously: before release as a regression gate, and after release as production monitoring. It exists because AI systems fail differently from other software. Traditional services fail loudly; an LLM application drifts in tone, accuracy, or policy compliance while every health check stays green. This article describes the evaluation stack that catches silent failure: version-controlled golden sets run in CI, scoring methods from exact match to LLM-as-judge with their failure modes, drift monitoring over production samples, calibrated human review, and incident thresholds agreed in advance, so a bad eval trend triggers a rollback instead of a debate.
An AI system without evals is a service without tests that also rewrites its own behavior whenever your vendor ships an update. Nobody would run the first. Most organizations are currently running the second.
The mental model that makes evaluation tractable is that evals are regression tests for behavior. Everything you know about test discipline transfers: version control, CI gates, coverage thinking, flake management. One difference is brutal, though, and it shapes the whole program. AI systems fail silently. Latency is fine, error rate is zero, dashboards are green, and the answers have quietly gotten worse. Nothing pages. Your observability stack watches whether the system is up; evals watch whether it is right. You need both, and they are built with the same reflexes.
Golden sets: the executable specification
A golden set is 50 to 500 input cases with expected outputs or grading criteria, version-controlled next to the application, run automatically on every change to prompt, model, retrieval index, or tool configuration.
What goes in it:
- Representative traffic. Anonymized real inputs, stratified across the intents and document types the system actually sees, not the ones the design doc imagined.
- Known failures. Every production incident and every bug report becomes a case, permanently. This is where regression protection actually comes from. It is the AI equivalent of the test you write with every bugfix, and skipping it has the same consequence: you fix the same bug twice.
- Edge and adversarial cases. Ambiguous inputs, out-of-scope requests the system must refuse, prompt-injection attempts. That last category quietly makes the eval suite part of your LLM security posture.
- Policy probes. Cases that check the behaviors your governance rules require: no medical advice, cite sources, escalate on X.
Three rules keep a golden set honest. Cases are added by pull request, with
a rationale. Nobody deletes a failing case to make a release pass; that is
@Ignore on a failing test and it deserves the same shame. And the set is
refreshed quarterly from production sampling.
A static golden set measures an application that no longer exists.
Scoring: pick the weakest grader that works
| Method | Fits | Cost | Failure mode |
|---|---|---|---|
| Exact / pattern match | Classification, extraction, structured output | Trivial | Useless for prose |
| Programmatic assertions | JSON validity, citation presence, length, banned phrases | Cheap | Checks form, not truth |
| Semantic similarity | Paraphrase-tolerant matching to references | Cheap | Similar ≠ correct |
| LLM-as-judge | Open-ended quality: faithfulness, tone, helpfulness | Moderate | Bias, drift of the judge itself |
| Human review | Ground truth, calibration | Expensive | Does not scale; reviewers disagree too |
The principle: use the cheapest grader that measures the property. Extraction tasks should never need a judge. Conversational quality can rarely avoid one.
LLM-as-judge deserves its own paragraph, because it is where most programs drift from signal into theater. The practices that keep it honest, informed by the MT-Bench line of research: grade against a written rubric with anchored examples, never “rate 1 to 10”; prefer pairwise comparison of old output versus new for release decisions, because relative judgments are more stable than absolute ones; pin the judge model version, since an unpinned judge is a moving ruler; randomize answer position to control position bias; and calibrate against humans on a recurring sample, tracking agreement like the metric it is. When judge-human agreement sags below roughly 80% on binary criteria, stop believing the dashboard and fix the rubric.
The release gate
Wire the golden set into CI so behavior changes are blocked, not discovered:
# ci fragment — eval gate
eval-gate:
triggers: [prompt/**, models.yaml, retrieval/index-config/**]
run: evals run --suite golden/v2026.04 --judge claude-sonnet-4-5@pinned
pass_criteria:
correctness: ">= 0.90"
faithfulness: ">= 0.95" # RAG: answers grounded in sources
refusal_accuracy: ">= 0.98" # out-of-scope and injection cases
no_regression: "pairwise loss rate <= 0.05 vs main"
Thresholds are policy, not physics. Set them per consequence class, a customer-facing system earns stricter gates than an internal drafting tool, and record the reasoning as a decision record so the numbers survive the person who chose them.
Of the four criteria above, the pairwise no_regression check is the one I
would keep if forced to keep one. Absolute scores drift as judges and
rubrics evolve. “The new version loses to the old one on 12% of cases” is a
release conversation with teeth.
Provider model upgrades go through the same gate, without exception. A model version bump is a dependency upgrade. Nobody ships a major library upgrade without running the tests, and model vendors perform the equivalent of major upgrades on their own schedule, sometimes with release notes and sometimes without.
Production: sampling, drift, and the second loop
Pre-release evals prove the change you made is safe. Production evaluation catches the changes you did not make: provider-side updates, shifting user populations, retrieval corpora going stale or getting poisoned.
If the gateway pattern is in place the machinery is straightforward, because all traffic is already logged in one place.
Score a random slice of production request/response pairs with the same judges and rubrics as CI. Trend the scores and alert on sustained movement, not single bad days; a bad Tuesday is noise, a bad month is drift.
Segment before you average. Aggregate scores hide localized failure, fine overall while completely broken for one document type or one customer tier. Slice by intent, input language, and source corpus, or you will discover the broken segment when that customer does.
Watch the proxy signals that need no grader at all: human override rates at HITL gates, regeneration and thumbs-down rates, escalation volume, refusal rates. A rising override rate is often the earliest drift signal you get, and it costs nothing to collect.
For RAG systems, add grounding checks, faithfulness scoring of answers against retrieved chunks. That machinery is covered in RAG architecture for the enterprise.
Human review: calibrated, not heroic
Humans appear in three deliberate roles. Drive-by spot-checking, the senior engineer squinting at a few outputs when something feels off, is none of them.
- Calibration. A fixed weekly random sample, double-graded against the judge, agreement tracked. This is the work that licenses you to trust the automated scores at all.
- Adjudication. 100% review of flagged cases: judge-failed, user-reported, gate-overridden.
- Surge. Temporarily elevated sampling after any model, prompt, or index change, tapering as confidence returns.
Give reviewers a rubric, not a feeling, and measure inter-reviewer agreement now and then. When humans disagree with each other, your judge was never the problem. Your criteria were.
Incident thresholds: decide before it breaks
The difference between a mature program and a dashboard nobody acts on is one pre-agreed line: at what eval score, override rate, or complaint volume do we roll back?
Decide while calm. At 2 AM, with a degraded system and a full incident channel, “is this bad enough to roll back” becomes a negotiation, and negotiations default to waiting. A written threshold converts the negotiation into a lookup.
Wire the thresholds into the same paging and process as any other production incident: declared on numbers, owned by a commander, followed by a postmortem whose action items always include, at minimum, new golden-set cases. The playbook from infrastructure incident response applies without modification. The only novelty is that rollback may mean reverting a prompt or re-pinning a model version rather than redeploying a binary, which is exactly why versions must be pinned in the first place.
What to write down
Four artifacts make an evaluation program real: the golden set, in the repo with its change history; the rubrics, with anchored examples; the thresholds, release gates and incident triggers alike, with their rationale; and the calibration log showing judge-human agreement over time.
Not coincidentally, that set is also the Measure-function evidence the NIST AI RMF asks for, generated as a byproduct of engineering you needed anyway. Compliance as exhaust, not as a project.
This is what Level 3 of the integrity maturity model feels like from the inside, and it is worth saying plainly: not more dashboards. Fewer surprises. Every mature operational discipline eventually converges on that trade, and evaluation is simply the newest one to get there.
Frequently asked questions
- What is a golden set in AI evaluation?
- A golden set is a version-controlled collection of representative inputs paired with expected outputs or grading criteria, used to score system behavior the same way every time. It is the executable specification of the system: when a prompt, model, or retrieval change ships, the golden set is what proves behavior did not regress. Start with 50 to 200 cases drawn from real traffic and known failures, and grow it from every incident.
- Are LLM-as-judge evaluations reliable?
- Reliable enough to be useful, not reliable enough to go unaudited. Model graders scale to thousands of cases and correlate reasonably with human judgment on well-specified criteria, but they carry position bias, verbosity bias, and self-preference. The working pattern is judge for scale, humans for calibration: re-grade a sample of judge decisions by hand on a schedule and track agreement. When agreement drops, fix the rubric before you trust another score.
- How do I detect drift in an LLM application?
- Sample production traffic continuously, score it against the same criteria as your golden set, and watch the trend rather than individual scores. Drift arrives from provider-side model updates, shifting user behavior, and changing retrieval corpora. Score trends, override rates at human gates, and complaint volume together give early warning. The golden set alone cannot catch drift, because it only measures the inputs you already predicted.
- How many production outputs should humans review?
- Enough to bound your uncertainty, weighted by risk. A practical pattern: a fixed random sample weekly for calibration, plus every case flagged by judges or users, plus a temporary surge after any model or prompt change. For a moderate-volume system that lands around 1 to 5 percent random plus targeted review. The random slice matters most, because targeted-only review confirms the failures you already suspected and misses the rest.