Playbooks & Templates
Prompt Evaluation Checklist
This playbook judges a prompt by its outputs across many runs, after it has run, against a fixed set of inputs and a written rubric. It is the post-output companion to the Prompt Review Checklist, which checks a single prompt before you send it. Use this when one prompt feeds real work and you need to know whether a change made it better, worse, or just different.
When to use this
- You changed the wording of a prompt that runs in production and need to know whether quality moved up, down, or stayed flat.
- You are choosing between two prompt versions for the orders export summary and want a number, not a gut feeling.
- The same prompt gives a great answer on Monday and a thin one on Tuesday, so you suspect run-to-run variance rather than a real change.
- You are about to swap the model behind a prompt and want to catch any regression before users see it.
- A stakeholder asks "did the new prompt actually help?" and you have no evidence beyond one screenshot.
- You added a generic instruction to a prompt and want to confirm it did not quietly break a case it used to pass.
- You are building a recurring eval so future prompt edits get scored automatically instead of spot-checked by hand.
What it helps clarify
- Whether a prompt change is a real improvement or noise, by scoring the same fixed inputs several times and comparing.
- Which specific inputs the prompt fails on, so you fix the failure rather than the average.
- What "good output" means in writing, as a rubric anyone on the team can apply and get the same verdict.
- Whether your evaluation set still reflects real usage, or has gone stale and is giving you confident false signals.
- Where a fix on one case quietly broke another, by re-running the whole set instead of the one example you were looking at.
- Whether the gain is large enough to ship, separating a meaningful move from random variation between runs.
Why an average is not evidence
Maya has a prompt that summarizes the internal orders export for the on-call engineer. In the chat window it is excellent. She tweaks one line to make the summaries terser, the next two she tries look great, and she ships. Three days later an on-call engineer reads a summary that lists an order total that does not appear anywhere on the export page. The terser wording also dropped the instruction that kept the model grounded, and it only shows on the empty-page case nobody opens during a demo. That is how prompts fail. They do not break loudly, they drift, and the drift surfaces as a ticket.
The trap is that prompt changes feel safe. You add a sentence like "be concise and accurate" and assume you made things better. Research published in 2026 (the paper When "Better" Prompts Hurt) measured exactly this and found it is often false. Adding generic rules to a user prompt measurably dropped a model's retrieval-augmented performance, because the looser instructions nudged it to answer beyond the provided sources. The prompt looked stronger on paper. It was much worse in practice. The only way anyone caught that was by scoring a fixed set of inputs before and after the change. A change that reads like an obvious improvement is exactly the kind that needs a number, because the intuition that it helped is the thing that is wrong.
That is the job of this playbook. It is the post-output half of the prompt pair. The Prompt Review Checklist looks at one prompt before you send it and asks whether it has enough context, constraints, and examples. This checklist starts once outputs exist and asks a harder question: across many runs, on inputs you did not cherry-pick, does this prompt do the job, and did your last edit help or hurt? Anthropic's own evaluation guidance frames good evals as the difference between guessing and knowing, and the guidance is consistent across the field. Optimizing a prompt is an empirical process you can measure. You change the wording, you re-score the fixed set, you read the failures, you decide.
The cost of skipping it is subtle. A prompt does not break loudly. It drifts. One edit improves the tone and silently regresses the grounding on an edge case nobody runs in a demo. A month later the edge case is the bug report. An evaluation harness is the thing that would have caught it the day you made the change, while the diff was still in front of you and the fix was cheap.
How to fill it in well
The harness is only as good as the three pieces underneath it: the fixed input set, the rubric, and the run conditions. Most weak evaluations fail at one of these three, so spend your time here.
The fixed input set. This is a file of real inputs that does not change between runs. Current guidance from Anthropic and from prompt-eval tooling converges on a small, honest start: 20-50 cases is enough to catch real regressions, because effect sizes are large early on. The mix matters more than the count. A strong set is sampled from production and from things that have actually broken, roughly most cases routine, a slice of edge cases, and a few documented past failures. A weak set is twenty inputs you invented at your desk, all of which the prompt already handles. Worked contrast for Maya's orders export: a strong entry is "18 real export pages plus 4 empty-page edge cases plus the 2 pages where it once invented a total, saved to a folder, never edited." A weak entry is "a few example pages I typed up that look about right." The weak set will report a confident score and miss the empty-page bug entirely, because that case is not in it.
The rubric. A rubric is the written definition of good output plus a pass bar for each dimension. The test of a good rubric is simple. Two people scoring the same output should reach the same verdict. Start with two or three dimensions that matter most and add more only when they prove they catch something. For a summary prompt those are usually correctness (every number traces to the source), grounded (nothing invented), and format. A strong rubric line reads "Grounded: no order, status, or total appears that is not on the source page, pass bar 100%." A weak rubric line reads "should be accurate." The first one is checkable by a person or a model grader. The second one is a mood, and two scorers will disagree on it, which means your scores are noise.
The run conditions. Pin the model version, the temperature, and the system prompt, and write them down. Set temperature to 0 when you can, so a score change comes from the prompt rather than the dice. Then score each input several times, 3-5 runs, not once. This is the step people skip, and it is the one that separates a real result from a lucky draw. A single run that scores well might be the one good answer in five. Scoring three to five times surfaces the variance so you can see whether the prompt is reliably good or occasionally good. Keep every per-case score, not just the average, and compare the new prompt against the old one on the same inputs in the same run. The number that decides things is the per-case difference between versions, because that is what cancels out the noise that both versions share.
Pitfalls and using it on a team
The most common way an evaluation feels done while leaving the work undone is the stale golden set. The set was built six months ago, production phrasing has drifted since, and the eval now reports a clean 0.94 on inputs that no longer look like what users actually send. The score is confident and wrong. Refresh the set on a schedule, fold in new failures as they happen, and treat the input file as a living asset that is versioned as carefully as the prompt it tests.
The second pitfall is scoring on a single run. One run gives you one sample of a distribution. It tells you almost nothing about reliability, and it makes a noisy prompt look stable. The fix is to score 3-5 times per input and look at the spread, not only the mean.
The third is hiding behind the average. A prompt change that lifts the mean from 0.81 to 0.85 can be a change that helped twenty cases and broke two, and one of the two broken cases is a safety regression. Keep per-case scores. Treat any input that used to pass a safety or refusal check and now fails as a blocking issue, even a single instance, the same way you would block a deploy on one failed test. And actually read the low-scoring transcripts. An average never tells you why something broke. Anthropic's eval guidance is direct on this: you will not know whether your graders even work unless you read the transcripts yourself.
A fourth, quieter pitfall is trusting the model grader blindly. A model-based grader is fast and scales, but it is non-deterministic and carries biases (it tends to reward longer, more confident answers). Before you trust it, calibrate it against your own judgment on a sample and check that its labels match yours. The field treats roughly 75-90% agreement between the model grader and a human as the bar for using it at scale. Below that, the grader is adding noise, not removing it.
On a team, the evaluation becomes shared infrastructure. The input set and rubric live in the repo next to the prompt, so anyone who edits the prompt re-runs the same set and inherits the same bar. Recent product work shows how much this is worth: Anthropic's Outcomes feature, which re-runs a task when a separate grading agent scores it below a rubric, improved PowerPoint generation quality by about 10% without changing the model at all. The rubric and the re-scoring loop did all the work, with the same model. That is the same lever this checklist hands you. When you finish here, your next step chains naturally: a prompt that passes evaluation and feeds code generation flows into the Prompt Review Checklist before each send, and any code that prompt produces flows into Reviewing AI Code Safely in /docs before it merges. On your next real prompt edit, do one small thing: save five real inputs to a file and score the old and new prompt on them before you ship. That single habit catches the regression that an average would have hidden.
The checklist
Work top to bottom. The first three items build the harness, the middle scores it, the last two decide and protect against drift.
- Fixed input set : You have 20-50 real inputs locked in a file, drawn from production or past failures, that do not change between runs.
- Rubric in writing : Each dimension you care about (correctness, grounded in the given context, format, tone, safety) has a written definition and a pass bar, so two people score the same output the same way.
- Reference or expected output : For each input you have either a known-good answer or clear criteria, so a scorer can tell pass from fail without guessing.
- Same conditions every run : Temperature, model version, and system prompt are pinned and recorded, so a score change comes from the prompt and not the setup.
- Multiple runs per input : You score each input 3-5 times, not once, so run-to-run variance is visible instead of hidden behind a single lucky answer.
- Per-case scores, not just an average : You keep the score for every input, so a fix that helps the mean while breaking one case cannot hide.
- Baseline compared honestly : The old prompt and the new prompt are scored on the same inputs in the same run, and you compare the per-case difference.
- Failures read by a human : You actually read the low-scoring transcripts, because an average never tells you why something broke.
- Safety cases cannot regress : No input that used to pass a safety or refusal check is allowed to flip to fail, even once.
- Decision recorded : You write down ship, hold, or revert, with the number and the date, so the next person inherits evidence instead of a hunch.
Example
Usage notes
- Build the input set from real traffic and past failures first. A golden set written from imagination tests the cases you already thought of, not the ones that break.
- Score every input several times. A single run hides variance, and Anthropic's eval guidance is blunt that an average you never read past is not evidence.
- Start with two or three rubric dimensions and add more only when they earn their place. A ten-dimension rubric nobody applies consistently is worse than three that everyone scores the same way.
- Re-run the whole set on every prompt change, not the one example you were debugging, because a fix on one case routinely breaks a neighbor.
- If you have not yet sent the prompt, you want the Prompt Review Checklist instead. This playbook starts once outputs exist. The companion Reviewing AI Code Safely guide in /docs covers judging generated code, which is a different verdict.
- Refresh the input set on a schedule. Production phrasing drifts, and a frozen set slowly stops matching real usage while still reporting confident scores.
Copyable evaluation record
Downloadable version
A ten-point checklist for judging a prompt by its outputs across many runs against a fixed set and a rubric.
Preview
Build the harness
- Lock a fixed input set of 20-50 real inputs, drawn from production and past failures, that do not change between runs.
- Write the rubric down: each dimension (correctness, grounded, format, tone, safety) has a definition and a pass bar.
- Attach a reference answer or clear pass criteria to each input so a scorer can tell pass from fail.
- Pin and record the conditions: model version, temperature, and system prompt stay the same every run.
Run and score
- Score each input 3-5 times, not once, so run-to-run variance is visible.
- Keep the score for every input, not just the average, so a fix that breaks one case cannot hide.
- Score the old prompt and the new prompt on the same inputs in the same run, and compare the per-case difference.
- Read the low-scoring transcripts yourself, because an average never tells you why something broke.
Decide and protect
- Block any change where an input that used to pass a safety or refusal check flips to fail, even once.
- Record the decision (ship, hold, or revert) with the number and the date so the next person inherits evidence.