Playbooks & Templates

AI Code Review Checklist

This is a tick-through bar you run on a diff an assistant produced, after the code exists and before you keep, stage, or merge it. It is the review gate, separate from the brief that scoped the work and the tests that verify it. You work down the list once per change, and the output is a clear keep, fix, or split decision you are willing to put your name on.

PlaybookReviewSecurity

When to use this

You are about to keep, stage, or merge a change an assistant generated, and it is larger than a single line.
The diff adds an import, a dependency, or a file you did not ask the assistant to touch.
The assistant gave you a confident chat summary and you have not yet scrolled the actual changed lines.
A second AI reviewer (Copilot code review, Cursor Bugbot, Claude Code /security-review) already commented and you are tempted to treat that as a sign-off.
The generated change touches money, auth, data deletion, or anything you would have to debug yourself at 2 a.m.
A pull request has grown past a few hundred changed lines and you can feel yourself starting to skim.

What it helps clarify

Whether the diff actually does what the chat summary claimed, hunk by hunk.
Which new dependencies are real and intended, and which are hallucinated names you must not install.
What the code assumes about empty inputs, nulls, the last page, and duplicate requests, and where each assumption breaks.
What the change touched beyond the request, so unrequested scope is visible before it ships.
Whether the change is small enough to get a real human read, or large enough that it must be split first.
A defensible keep, fix, or split decision, with the assistant role recorded for the next reviewer.

Why this checklist earns its place

An assistant does not write code it knows is correct. It predicts plausible code for your request, and plausible is exactly the kind that reads cleanly and still breaks on the one input that matters. The chat summary describes intent. The diff is what compiles and runs. This checklist exists to close the gap between the two, one change at a time, before the change becomes something you have to debug in production.

Take Maya, adding a paginated export to an internal orders service. The assistant returns a confident, well-structured reply. Without a review pass she skims it, sees a sensible function, and accepts. Two defects ride along that the summary never mentions. The diff quietly added import fast-csv-stream, a package that does not exist, while the real library fast-csv was already installed. And the pagination defaulted to an unbounded page size, so an empty or absent page parameter loaded every row into memory on the exact input the feature exists to handle. The summary said "added efficient pagination". The patch did the opposite.

The dependency failure is not an outlier. A 2025 study presented at USENIX Security generated 576,000 code samples across sixteen models and found that roughly one in five (19.7%) of the 2.23 million packages they referenced does not exist. Worse for defenders, the hallucinations are predictable: when the researchers re-ran identical prompts, 58% of hallucinated names recurred on more than one run, and 43% appeared on every single run. Attackers register the most common phantom names and publish malware under them, a supply-chain tactic the industry calls slopsquatting. One accepted install command can pull a malicious package straight into your build. The Imports verified item is the line of defense, and it is non-negotiable on every change.

The size failure points at the 2026 reality. Teams now generate far more code than before, but a human reads roughly the same number of lines per hour. SmartBear's long-running analysis of Cisco's reviews found defect density (defects found per line) dropping sharply once a review runs past 400 lines, and Google's data shows median time-to-review doubling for every additional 100 changed lines. The constraint moved from writing code to reviewing it. A checklist run on a small diff is leverage; the same checklist waved at a 900-line pull request is theater. That is why Size reviewable sits near the end as a gate, not a suggestion.

How to fill it in, with a good entry versus a weak one

Each item is a statement you either tick honestly or cannot. The discipline is to make every tick observable, tied to something you saw in the diff or ran in the terminal, rather than a feeling that the code looks fine. The difference between a real review and a rubber stamp shows up item by item.

Imports verified. A weak tick reads "dependencies look standard". A strong entry reads "the diff added fast-csv-stream; it is not on the registry and the real library is fast-csv, already installed, so I removed the line and repointed the code." The weak version trusts the name; the strong version checked the registry and the lockfile.
Assumptions named. A weak tick is "logic looks correct". A strong entry is "the default page size is 0, which means unbounded, so an empty page parameter loads every row; I set a default and a hard cap of 1000 and added the empty and last-page cases." Notice the strong version names the specific assumption and the input that breaks it. "Looks reasonable" is a bar plausible code passes by design, so replace it with "what does this assume, and where does that assumption fail?"
Scope honest. A weak tick is "nothing looks off". A strong entry is "two of three changed files match the request; the third was a new import block I did not ask for, now removed." A review that does not actively hunt for unrequested change will not find it, because plausible scope creep reads like helpfulness.
Security checked. A weak tick is "auth is handled". A strong entry is "the export route is auth-gated, page params are coerced to integers, and no user-controlled string reaches the query." Name the check, not the hope.

Work the list in order, because the order is doing work. You state the intended behavior first so every later item has something to measure against. You read the full diff before you judge any single change. You verify imports before you run anything, since the cost of a bad install is paid the moment you type it. You decide on size before you invest a deep read, so you do not sink twenty minutes into a 900-line change that should have been three pull requests. The example above shows the realistic shape: most items tick, a few do not, and the unticked ones are precisely the work the review found. A checklist where every box is pre-ticked did not review anything.

One caution on the second AI line. Running Copilot code review, Cursor Bugbot, or Claude Code /security-review first is genuinely useful: Bugbot runs parallel passes focused on logic and security bugs while ignoring style, and Claude's reviewer dispatches several specialized agents. Teams report a layered pipeline cutting human review time noticeably by clearing mechanical issues before a person looks. The trap is reading a clean second-AI pass as approval. Copilot code review leaves a "Comment", never an "Approve", stays silent on roughly 29% of reviews, and its own documentation notes it can miss SQL injection, XSS, and any architectural decision. Tick the box only to record that the mechanical layer ran, then make the keep, fix, or split call yourself.

Pitfalls, edge cases, and using it on a team

Most ways this checklist fails feel like progress while skipping the part that mattered.

Reviewing the summary. The chat explanation sounds right, so the diff goes unread and every later item is ticked on faith. The fix is the hard rule behind Full diff read: no tick without scrolling the changed lines.
Skipping the import check on a "boring" change. A new package name looks ordinary, so it is installed unverified. With one in five samples naming a phantom package, the boring changes are exactly where slopsquatting lands.
The diff too big to read. A 900-line generated change cannot get a real review, so the reviewer skims and approves. Treat Size reviewable as a stop sign: split first, then review the pieces.
Pre-ticking the whole list. A checklist where nothing is ever unticked is a ritual. The unticked items are the deliverable; if a real review never turns one up, you are reading too lightly.
Trusting the bot as a sign-off. A green Bugbot or Copilot pass is a checklist of mechanics, and documented cases show these tools flagging real bugs even after a human approved. That cuts both ways: lean on them for mechanics, and keep intent and architecture on the human.

Two honest edge cases sit beyond the simple rules. The large diff you cannot split. A generated migration or a mechanical rename can legitimately touch hundreds of lines. Here you review by category rather than line by line: confirm the transformation is uniform, spot-check a representative sample, and lean on a test suite that proves behavior held, then tick Size reviewable with that approach noted. The non-deterministic agent. When an agent ran a loop of edits and commands, the same prompt may not reproduce the diff, and the agent cannot reliably explain why it chose a given path. Review the artifact that exists, the diff and the test results, and treat the agent's narration as a hint to verify rather than a record to trust.

On a team, one person's standard does not survive thirty pull requests a day, so the checklist has to live in artifacts. Adopt a pull request template that records the assistant, the prompt or plan, the new dependencies (ideally none), and what the author verified, which turns each review into checking a claim rather than starting cold. Make the size ceiling a shared policy, not a personal habit, since the defect-density and review-speed data only pay off when everyone keeps changes small. And run the same checklist across the team so a busy day does not quietly lower the bar. For the reasoning behind every item, send people to the Reviewing AI Code Safely guide; for the steps that precede and follow this gate, see the AI Coding Session Brief and Testing AI-Generated Code. Then start small on your next real change: open the full diff, confirm every new import is real, and keep the pull request small enough that the next person can run this exact list.

The checklist

Run it top to bottom on one diff, once. Every item is a tickable statement; if you cannot tick it honestly, that is the work the review just found.

Behavior stated : you wrote, in one line, the behavior you asked for, so any change that does not serve it is suspect.
Full diff read : you scrolled every hunk of the actual changed lines, not the chat summary of them.
Imports verified : every new import resolves to a real package on the registry, and it is the library you meant, not a look-alike name.
Assumptions named : for each change you can state what it assumes about inputs, state, and failure, and you checked the empty, null, last-page, and duplicate cases.
Scope honest : nothing was changed, renamed, or added beyond what you asked for, or the extra change is justified and intended.
Complexity earned : no abstraction, config flag, or layer was introduced that the task did not need.
Security checked : the change handles untrusted input safely (no injection, no secrets in code, authorization still enforced on the new path).
Build and tests green : the build passes, existing tests pass, and you ran the edge cases the happy path skipped.
Size reviewable : the change is small enough for a human to actually read, or it has been split into reviewable steps.
Decision recorded : keep, fix, or split is decided, and the description names the assistant, the prompt, and what you verified.

Example

Worked example

Change under review: add server-side pagination to the orders export, keep the CSV format (Maya, internal orders service). [x] Behavior stated: "paginate the export, same CSV columns, nothing else changes." [x] Full diff read: 140 lines across src/api/orders.ts and two helpers; read every hunk. [ ] Imports verified: diff added fast-csv-stream. It does not exist on the registry. The real library is fast-csv, already a dependency. Hallucinated name, removed, repointed at fast-csv. [ ] Assumptions named: default page size of 0 means unbounded, so an empty page parameter loads every row. Set a real default and a hard cap of 1000. [x] Scope honest: no other files or behavior changed once the bad import was removed. [x] Complexity earned: no new abstraction, plain query change. [x] Security checked: export is auth-gated, page params are integers, no user input reaches the query string. [ ] Build and tests green: last-page case threw before the fix, passes after; empty and negative-page cases added. [x] Size reviewable: under 200 lines, one focused pull request. [x] Decision recorded: keep after two fixes; description names Claude Code, the prompt, and what was verified. Decision: fix two items, then keep.

Usage notes

Run it on the diff, not the chat. The summary is the assistant claim; the changed lines are the evidence, and the gap between them is where the review earns its place.
The import check is non-negotiable on every change. A 2025 USENIX Security study found roughly one in five generated samples named a package that does not exist, so treat each new dependency as unverified until you confirm it.
If the size item cannot be ticked, stop and split before you review further. A pull request past a few hundred lines stops getting a real read, so a smaller diff is the fix, not a longer sitting.
Treat a second AI pass as a checklist of mechanics, never an approval. Copilot code review leaves a Comment and never an Approve, and its own docs say architecture and intent sit beyond a diff-level reviewer.
This playbook pairs with the Reviewing AI Code Safely guide, which explains the why behind each item. It follows the AI Coding Session Brief (which scopes the work) and hands off to Testing AI-Generated Code (the verification step).
Revisit the list whenever a real defect slips through. If a recurring miss is not on it, add the item so the next review catches what this one did not.

Copyable output

# AI Code Review Change under review: Generated with: Prompt / plan: ## Read - [ ] Behavior stated in one line - [ ] Full diff read, hunk by hunk ## Verify - [ ] Every new import is a real, intended package - [ ] Assumptions named; empty / null / last-page / duplicate checked - [ ] No unrequested scope, no unearned complexity - [ ] Untrusted input handled safely; authorization still enforced ## Run - [ ] Build and existing tests pass - [ ] Edge cases the happy path skipped were run ## Decide - [ ] Size is reviewable, or change was split - [ ] Decision: keep / fix / split - [ ] Verified by:

Downloadable version

A one-pass bar for reviewing a diff an assistant produced, before you keep, fix, or merge it.

Preview

Read the change

I wrote the behavior I asked for in one line, so any change that does not serve it is suspect.
I scrolled every hunk of the actual diff, not the chat summary of it.
I can name what each changed file is doing and why it is in this diff.

Verify the substance

Every new import resolves to a real package on the registry, and it is the library I meant, not a similar name.
For each change I can state what it assumes about inputs, state, and failure.
I checked the empty result, the null, the last page, and the duplicate request, not only the happy path.
Nothing was changed, renamed, or added beyond the request, or the extra change is justified.
No abstraction, flag, or layer was introduced that the task did not need.
Untrusted input is handled safely, no secrets are hard-coded, and authorization is still enforced on the new path.

Run and decide

The build passes and the existing tests pass.
I ran the edge cases the happy path skipped.
The change is small enough for a human to read, or I split it into reviewable steps.
A second AI pass, if any, was treated as a checklist of mechanics, not an approval.
The decision is recorded as keep, fix, or split, with the assistant, the prompt, and what I verified named in the description.

Download PDF