Guide · AI in the Developer Workflow

Reviewing AI Code Safely

AI-generated code is a draft until you have read the diff. The assistant predicts plausible code, and plausible is exactly the kind that slips through. Your control sits at the review gate, where you read every hunk, question each assumption, and decide what ships.

ReviewDiffsSecurity
~10 min

When to use this

  • You are about to keep, stage, or merge code an assistant produced.
  • The change is larger than a line, or it touches code you would have to debug yourself later.
  • The diff adds a dependency, an abstraction, or a file you did not ask the assistant to touch.
  • A second AI (Copilot review, Cursor Bugbot, Claude Code /review) already commented, and you are tempted to treat that as a sign-off.

Key ideas

The diff is the work
Review the change itself, not the chat summary of it. The summary describes intent; the diff is what compiles and runs. Read every hunk before you accept, because the gap between what the assistant says it did and what the patch actually does is where defects live.
Verify every import
A 2025 study of 2.23 million packages generated across 576,000 code samples found roughly one in five (19.7%) referenced a package that does not exist. Attackers register those hallucinated names, a tactic called "slopsquatting", and 58% of hallucinations recur on the same prompt. Confirm each new dependency is real and intended before you install it.
Review is the bottleneck now
AI lets a developer merge far more pull requests, but reviewers absorb the same volume per hour as before. PR review time has climbed sharply industry-wide while perceived speed outran real speed. Keep diffs small so a human can actually read them; a PR over 400 lines stops getting a real review.
Ask what assumptions it made
The useful question is "what assumptions is this code making, and are they safe?" not "does this look reasonable?". Plausible code reads fine and still mishandles an empty list, a null, a duplicate request, or a timezone. Probe the assumptions the happy path hides.
A second AI is still a draft
Copilot code review, Cursor Bugbot, and Claude Code /security-review read broad repository context, yet their own docs are explicit that human review supplements the tool. Copilot leaves a "Comment", never an "Approve". Treat their findings as a checklist that catches mechanical issues, while a human still decides on intent.

Why this matters

An assistant does not write code that it knows is correct. It predicts the next token from everything in its context, so it produces code that looks like correct code for your request. Most of the time that overlaps with working code. The dangerous gap is the cases where the plausible output and the correct output diverge, and the output still reads cleanly enough to wave through.

Picture a developer, Maya, asking an assistant to add a paginated export to an orders endpoint. The chat reply is confident and well structured. She skims it, sees a sensible-looking function, and accepts. Two things were wrong that the summary never mentioned. The diff quietly added import fast_csv from 'fast-csv-stream', a package that does not exist on the registry. And the pagination used a default page size that loaded every row into memory on the last page. The chat said "added efficient pagination". The patch did the opposite on the one input that mattered.

That first failure is not rare. A 2025 study presented at USENIX Security generated 576,000 code samples across 16 Python and 14 JavaScript models, referencing 2.23 million packages, and found roughly one in five (19.7%) of those packages did not exist. Open-source models hallucinated at 21.7%, commercial ones at 5.2%, and 58% of those hallucinations recurred on the same prompt. Attackers now register the most common hallucinated names and publish malware under them, a supply-chain tactic the industry calls slopsquatting. A single accepted install command can pull a malicious package into your build.

The second failure points at the broader 2026 reality. Teams ship far more AI-written code than before, but a human still reads roughly the same number of lines per hour. Industry analysis of millions of pull requests found PR review time rising sharply even as merge volume climbed, and developers who felt faster were measurably slower once review was counted. The constraint moved from writing code to reviewing it. Your leverage is the review gate, and the mechanism that makes that gate work is the diff itself, which the next section breaks down.

How it works

Reviewing AI code is reading a diff with a specific set of questions, on a change small enough that the questions can actually be answered. Four moving parts decide whether the review is real.

  • The diff, not the summary. The chat reply describes intent. The diff is what compiles and runs. Read the changed lines directly, hunk by hunk, because the divergence between the two is precisely where the defect hides.
  • The assumptions. Every block of code assumes something about its inputs, its state, and how it fails. Plausible code makes plausible assumptions that are wrong at the edges. Ask of each change: what does this assume, and what happens when that assumption breaks?
  • The dependencies. A new import is a new trust boundary. Confirm each package exists, is the one you intended, and is not a near-name of a popular library. This is the slopsquatting check.
  • The size. A reviewer is effective up to a few hundred changed lines and then starts skimming. The review only happens if the change is small enough to fit a human read.

The distinction that decides success is mechanics versus intent. Mechanical issues are off-by-one errors, a missing null check, an unhandled promise, a wrong comparison operator. Intent issues are whether the change solves the right problem, fits the architecture, and does not quietly broaden scope. Automated tooling is good at mechanics and clears the routine layer fast, which lets your scarce attention land on intent. The opposite order, where a human hunts for a missing semicolon and a linter is left to judge architecture, wastes the part of the review only a person can do.

That is why the practical pattern in 2026 is layered. A linter and a scanner run first. A second AI reviewer (Copilot code review, Cursor Bugbot, Claude Code /security-review) runs next and surfaces mechanical and security candidates. Teams report this front filter cutting human review time by 30-50%. The human reads last, focused on the questions a model cannot answer: is this the right change, are these assumptions safe, did it touch something it should not have. A vital caveat sets up the worked scenario: the second AI is itself a draft. Copilot code review leaves a "Comment", never an "Approve", and its docs say plainly it will miss bugs, security issues, and most architectural concerns. It is a checklist that flags mechanics, and a human still makes the call.

A worked scenario

Back to Maya and the orders export, this time reviewed properly. The task was scoped: add server-side pagination to the orders export and keep the existing CSV format. The assistant produced a 140-line diff touching src/api/orders.ts and two helpers. Here is the pass she runs.

  1. Read the behavior back. Maya states what she asked for in one line at the top of the diff view: "paginate the export, same CSV columns, nothing else changes." Now any change that does not serve that line is suspect.
  2. Open the full diff. She reads every hunk, not the chat summary. Two of the three changed files match the request. The third, a new src/api/orders.ts import block, does not.
  3. Check the imports. The diff added fast-csv-stream. She searches the registry and the project lockfile. It does not exist. The real library is fast-csv, already a dependency. This is a hallucinated name, and installing it blindly is the slopsquatting risk. She removes the line and points the code at the existing fast-csv.
  4. Probe the assumptions. She asks the assistant directly: "what does this pagination assume about the last page and an empty result set?" The honest answer surfaces that the default page size of 0 means unbounded, so an empty or absent page parameter loads every row. She sets a real default and a hard cap of 1000.
  5. Run the edge cases. She runs the build and tests, then exercises the empty result, the last partial page, and a negative page number by hand. The last-page case threw before her fix and passes after.
  6. Commit small and record it. The change goes in as one focused pull request well under 400 lines, with a description that names the assistant, the prompt, and what she verified.

The before-and-after is concrete. Before review: one hallucinated dependency and an out-of-memory path on the exact input the feature exists to handle. After a fifteen-minute read: a real dependency, a bounded query, three edge cases covered, and a pull request the next reviewer can actually read. The summary alone would have hidden every one of those. Maya carries straight into the traps below, because the same diff offers several ways to feel finished while leaving the work undone.

Pitfalls and edge cases

Each of these traps feels like progress while quietly skipping the part of the review that mattered.

  • Reviewing the summary. The chat explanation sounds right, so the diff goes unread. The fix is a hard rule: no accept without scrolling the changed lines, because the summary is the assistant’s claim and the diff is the evidence.
  • Skipping the import check. A new package name looks ordinary, so it gets installed. With one in five samples naming a phantom package, treat every new dependency as unverified until you confirm it on the registry and confirm it is the library you meant, not a look-alike.
  • The diff that is too big to read. A 900-line AI change cannot get a real review, so the reviewer skims and approves. The fix is to cap pull requests, keep them under 400 lines, and split a large generated change into reviewable steps before anyone looks at it.
  • Reading for "looks reasonable". Plausible code passes that bar by design. Replace it with "what assumptions does this make and where do they break?", which forces the empty-list, null, and duplicate-request cases into view.
  • Trusting the second AI as a sign-off. A clean Copilot or Bugbot pass feels like approval. It is a checklist of mechanics. Bugbot itself is documented finding real bugs even after human approval, which cuts both ways: the tools catch things you miss, and you catch the intent and architecture they miss.

Two genuine edge cases sit beyond the simple rules. The large diff you cannot split. A generated migration or a mechanical rename can legitimately touch hundreds of lines. Here the move is to review by category rather than line by line: confirm the transformation is uniform, spot-check a representative sample of the changes, and lean on a test suite that proves behavior held, rather than pretending to read every identical edit.

The non-deterministic agent. When an agent ran a loop of edits and commands, the diff may not be reproducible from the same prompt, and the assistant cannot reliably tell you why it made a given choice. Do not ask it to justify intent after the fact and trust the answer. Review the artifact that exists, the diff and the test results, and treat the agent’s narration as a hint to verify rather than a reliable record. These pressures intensify once a change leaves your machine and meets a team, which is where scale comes in.

Reviewing AI code on a team and at scale

One developer can hold a review standard in their head. The moment a team is generating AI code into a shared repository, the standard has to live in artifacts and habits, or review volume buries it. A lead, Priya, owns this view, and her job is to make the gate survive contact with thirty pull requests a day.

The first durable artifact is a PR template that records the assistant’s role. Each description names which tool generated the change, the prompt or plan behind it, the new dependencies (ideally none), and what the author verified. This turns review from guesswork into checking a claim, and it lets the next person repeat the pass instead of starting cold. It also makes AI-authored changes visible in the history, which matters when a defect is traced back months later.

The second is a shared size discipline. Keeping pull requests under 400 lines is a team policy, not a personal habit, because the research is consistent: defect density found per line drops sharply once a review runs past a few hundred lines, and small changes merge markedly faster. Pairing that ceiling with a layered pipeline (a linter, a scanner, then a second-AI reviewer for mechanics) keeps human attention on intent, where the 30-50% time saving comes from putting the AI filter first and the person last.

The third is a shared review checklist that everyone applies the same way: read the diff not the summary, verify every new import, name the assumptions, flag unrequested changes, and confirm the edge cases were tested. A team checklist makes the review repeatable across people and resilient to a busy day, the way the AI Code Review Checklist on this site lays it out.

The durable principle, whatever the size of the team, is steady: AI-written code is a proposal, and a human read of the actual diff is what turns a proposal into committed code you are willing to own. Start with one small thing on your next AI change. Open the full diff, confirm every new import is real, and keep the pull request small enough that the person after you can do the same.

Workflow

  1. 01State the behavior you asked for, then open the full diff before keeping anything.
  2. 02Read every hunk, and confirm each new import resolves to a real, intended package on the registry.
  3. 03For each change, ask what assumptions it makes about inputs, state, and failure, and whether they hold.
  4. 04Question new dependencies, new abstractions, and any file you did not ask it to touch.
  5. 05Run the build, the tests, and the edge cases (empty, last page, invalid input), not only the happy path the assistant showed you.
  6. 06Run an automated layer (linter, scanner, second-AI review) to clear routine issues before your eyes reach intent.
  7. 07Commit in small, reviewable steps and record the assistant’s role and the prompt in the description so the next reviewer can repeat this pass.

Common mistakes

  • Accepting a diff because the chat explanation sounded right, without scrolling through the actual changed lines.
  • Running an install command on a hallucinated package name the assistant invented instead of an existing one.
  • Letting a 900-line AI diff land in one pull request, so the reviewer skims and the real review never happens.
  • Reading for "does this look reasonable" instead of probing the edge case the happy path quietly skips.
  • Treating a green Copilot or Bugbot comment as approval, when the tool only flags mechanics and misses the architectural decision.

Examples

Review prompt
Review this diff critically before I accept it. List every new import in src/api/orders.ts and confirm each package actually exists on the registry. What assumptions does this code make about its inputs and state, and where would each one break? Where is this more complex than the task needed? What did it change that I did not ask for? Which edge case (empty result, last page, invalid input, duplicate request) is untested?
PR description stub
What changed: cap the orders export at 1000 rows and paginate. Generated with: Claude Code, plan mode, then agent. Prompt: "add server-side pagination to the orders export, keep the CSV format". Reviewed: read full diff, ran build + tests, checked empty and last-page cases. New deps: none.

Notes

  • This page covers reviewing a change an assistant produced before you keep it. Writing the prompt that produced the change is covered in Giving AI the Right Context, and agreeing the approach before any code is written is covered in Planning Before Coding.
  • Pairs with the AI Code Review Checklist for a tick-through bar, and with Testing AI-Generated Code, which is the verification step this review hands off to.