Guide · AI in the Developer Workflow

Refactoring with AI

Refactoring changes the structure of code while its observable behaviour stays identical. An assistant can rename, extract, and reshape in seconds, and that speed is exactly what makes it risky. Your control comes from three levers: a green test suite that locks current behaviour before you touch anything, a plan that moves in small steps, and the final diff you read line by line before you keep it.

RefactoringCharacterization testsBehaviour

~10 min

When to use this

The code works but is hard to read, test, or extend, and you can describe the structure you want instead.
You have tests that cover the behaviour, or you can write them before you change a single line.
A rename, an extraction, or a signature change has to ripple consistently across many files and call sites.
The assistant proposes a large, tidy-looking rewrite and you cannot tell at a glance what behaviour it changed.

Key ideas

Behaviour stays put: The point of a refactor is that nothing observable changes: same return values, same thrown errors, same side effects, same public interface. Your tests are how you prove that. They pass before you start, and they still pass after every step. A failing test during a refactor means the move broke something, not that you should change the test.
Two hats, never both: Restructuring and changing behaviour are separate jobs, a distinction Martin Fowler calls the two hats. Wear one hat per commit. Either you are refactoring with green tests throughout, or you are adding function and writing new tests, never both at once, so a reviewer can see in a glance which kind of change they are reading.
Lock behaviour with characterization tests: When code has no tests, write characterization tests first. A characterization test, a term from Michael Feathers, records what the code actually does today, not what it should do. You run the code, capture its current output, and pin it. That captured behaviour becomes the safety net every later move has to keep green.
Small moves over rewrites: Models clear only about 40% of complex refactoring tasks in benchmarks, and AI-authored changes carry roughly 1.7x the issue rate and 1.4x the critical-defect rate of hand-written code. They handle single-file edits but collapse on repository-wide reasoning. A series of small, verified moves keeps each step inside what the model and your review can actually check.
Read the diff, not the description: A refactored file often looks tidier than the original even when it is broken. The dangerous failures are quiet: a dropped validation branch, an inverted boolean, a changed error type. Read the diff against your stated constraints and reject anything that alters a return value, an error path, or a side effect, however clean it looks.

Why this matters

Refactoring is the one kind of change where the user is supposed to notice nothing. You improve the shape of the code, and every output, error, and side effect stays exactly as it was. That is also what makes an AI refactor dangerous. The assistant predicts plausible code, and a refactored file that looks tidier than the original reads as a success even when it quietly changed behaviour.

Picture a developer asking an assistant to clean up a sprawling validateOrder function. The reply is genuinely nicer: shorter, well-named, the nesting flattened. The developer skims it, sees no obvious problem, and merges. What the cleanup silently dropped was one branch, the guard that rejected orders with a negative quantity, because the model folded several conditions into one and lost an edge case along the way. There was no failing test, because nobody had written one for negative quantities. Weeks later a refund flow lets a negative line item through, and the team is debugging a money bug that the refactor introduced and the diff hid behind cleaner formatting.

The research describes this exact shape. Two independent studies in 2024 and 2026 found state-of-the-art models clearing only about 40% of complex refactoring tasks, and the common failures are the quiet ones: dropping a branch that held input validation, inverting a boolean so a condition reads backwards, mishandling JavaScript's this binding when extracting a function. A 2026 SonarSource survey found 88% of developers reporting negative AI effects on technical debt, and 53% pointing to code that looked correct but was unreliable.

The payoff of doing it well is that you keep all of the speed and remove almost all of the risk. The shared spine of this whole topic is that the assistant predicts from its context, and your control comes from three levers: the context you supply, the plan you agree before it writes, and the diff you review before you keep it. A refactor leans hard on all three, with one addition that is specific to this kind of change. Because behaviour must stay fixed, you get a fourth check the model cannot fake: a test suite that was green before and has to be green after. The mechanism that turns that into a reliable safety net is what the next section breaks down.

How it works

A safe refactor is a chain of behaviour-preserving transformations. Each step changes the structure and leaves the observable behaviour identical, which means a passing test before a step should still pass after it. Get the pieces in the right order and the model can move fast without you losing control. The pieces are these.

The safety net. A green test suite that exercises the behaviour you are about to reshape. It is what tells you a step preserved behaviour or broke it. No net, no safe refactor.
Characterization tests. When the code has no tests, you write tests that record what it actually does today, including its quirks, before you change anything. They pin current behaviour so the refactor has something to stay faithful to.
The constraint. An explicit statement of what must not change: the signatures, the return shapes, the thrown error types, the side effects, the public interface. This is the rule the model is forbidden to break.
The small move. One rename, one extraction, one inlined variable at a time, each followed by a full test run. Small moves keep every step inside what your review and the model can actually check.
The diff. The exact lines that changed, read against the constraint, not against the prose summary the model wrote about itself.

The distinction that decides success is Martin Fowler's two hats. When you wear the refactoring hat, every change is behaviour-preserving and the tests stay green throughout; a red test means you made a mistake, not that the test is wrong. When you wear the adding-function hat, you are changing behaviour, so you write new tests and existing ones may legitimately change. The two hats never share a commit. Mixing them is how a behaviour change hides inside the noise of a hundred moved lines, where no reviewer will find it.

The second decisive fact is where models break. They do reasonably well on a localized, single-file edit. They collapse on repository-level reasoning: tracking every call site of a renamed function across files, updating a signature consistently everywhere it is used, rewiring all the callers. Most AI refactoring failures are incomplete compound transformations, the extraction that happened but left three callers pointing at the old shape. The practical consequence is that you ask the assistant to map the blast radius first, so you can see whether the move is a single-file edit it can handle or a cross-file change where you need to verify every call site yourself. With the order in place, the next section walks one refactor through end to end.

A worked scenario

Maya, a developer, inherits a checkout module with a parseTotals function in cart.ts. It is two hundred lines, deeply nested, and has no tests. She wants to extract the validation logic into a clean validateOrder(order) so it can be reused, and she wants behaviour to stay byte-for-byte identical. Here is how she runs it.

Build the safety net first. Because the code is untested, she does not refactor yet. She asks the assistant to write characterization tests that pin the current behaviour, including the awkward cases: empty cart, negative quantity, missing field. The prompt is explicit, "do not change the code, capture what it does today, and run the tests so I can see them pass." Two of the captured behaviours look like bugs, but she keeps them for now; the refactor's only job is to preserve behaviour, and fixing bugs is the other hat.
State the constraint up front. She writes it at the top of the refactor prompt: keep the function signature, return the same shape, throw the same RangeError on bad totals, preserve the empty-cart and negative-quantity behaviour the characterization tests just locked. Do not edit the tests.
Ask for the blast radius. Before any edit, she asks the assistant to list every file and call site the extraction touches. It names cart.ts and three callers in checkout.ts and refund.ts. Now she knows this is a cross-file move, not a single-file edit, so the call sites need checking.
One move, then run. She has it perform the extraction only, no other cleanup, and produce one diff. She runs the full suite. Nineteen of nineteen characterization tests stay green. One caller in refund.ts still references the old inline shape, exactly the incomplete-compound-refactoring failure mode. She has the assistant fix that one caller and runs the suite again.
Read the diff against the constraint. The diff is forty lines. She checks the extracted validateOrder against the original line by line: same conditions, same error type, no dropped branch. The negative-quantity guard is intact. She rejects nothing.
Commit the refactor alone. She commits the extraction as one behaviour-preserving change, with a message that says so. The two suspected bugs she found earlier go on a separate ticket, to be fixed under the adding-function hat in their own commit.

The before and after is the lesson. Without the characterization tests, the dropped guard from the first section would have shipped invisibly. With them, the refactor was provably faithful, the one stale caller was caught in minutes, and the structure improvement landed clean. Maya and this same validateOrder change carry into the pitfalls below.

Pitfalls and edge cases

Each trap here feels like progress while quietly turning the refactor into a behaviour change you cannot see.

Refactoring with no net. Reshaping untested code feels efficient because nothing turns red. Nothing turns red because nothing is watching. The fix is to write characterization tests first, capturing current behaviour before the first move, so a shifted behaviour has something to fail against.
Editing the test to make it pass. When a test fails mid-refactor, the tempting shortcut is to adjust the test. That converts your safety net into a rubber stamp. While wearing the refactoring hat, a red test means the move broke something; revert the move, do not soften the check.
Accepting the clean-looking rewrite. A large diff that reads tidier than the original earns trust it has not proven. Tidier formatting hides dropped branches and inverted conditions. The fix is to keep moves small enough to read fully, and to review against the constraint and the tests rather than against how clean it looks.
Mixing the two hats. Slipping a small behaviour fix into a big structural commit feels harmless and saves a commit. It also buries the one line that matters under a hundred that do not. Keep the hats in separate commits so the real change is visible.

Then the genuine edge cases. The cross-file rename the model leaves half-done. Repository-wide changes are where models fail most, by updating the definition and missing some callers. The handling is to refuse to treat a cross-file refactor as one step: have the assistant enumerate every call site first, apply the change, then verify with a compile or a full test run that no caller still references the old shape, exactly as Maya caught the stale refund.ts caller.

The behaviour that turns out to be a bug. Characterization tests pin what the code does, including its mistakes. If your refactor would "fix" one of those mistakes, that is a behaviour change wearing a refactor's clothes. The 2026 wrinkle is that an assistant will often quietly correct an obvious-looking bug while restructuring, which breaks behaviour preservation and any caller that depended on the old quirk. Keep the suspected bug behaviour green through the refactor, and fix it deliberately afterward under the adding-function hat, with its own test and its own commit. Keeping that discipline consistent when several people refactor the same repo is where scale comes in.

Doing it on a team and at scale

One developer can hold the two-hats rule in their head. The moment a second person, a second tool, or a second month is involved, the discipline has to live in shared artifacts, or the history fills with commits that quietly mixed structure and behaviour and nobody can bisect.

The first durable artifact is the test suite itself, treated as the contract. A team that refactors with AI keeps behaviour-preservation tests in version control and runs them in CI, so a refactor that breaks behaviour fails the pipeline rather than relying on a tired reviewer to spot it. Governed refactoring, validated by tests and CI, is what separates a safe AI refactor from a suggestion that merely looked right in an editor. Organizations with systematic testing protocols for refactored code report far fewer post-deployment issues.

The second artifact is the small pull request. Review capacity, not code volume, is the bottleneck in 2026, and an assistant can produce a thousand-line refactor in seconds that no human can deeply review. The widely cited bar is to treat 200 lines as the target and 400 lines as a hard ceiling, since reviews of larger diffs turn superficial and PRs over 400 lines often stall for days. Smaller, single-purpose refactor PRs cut cycle time by 30–40% and let a reviewer actually check the diff against the behaviour constraint. A lead, Priya, sets a PR size gate in CI and adds two lines to the PR template: which assistant produced the change, and the prompt or task that drove it, so a reviewer can ask the right question.

The third is a shared statement of the rule, in a versioned file the tools read, such as CLAUDE.md or AGENTS.md. A short entry, "refactors must be behaviour-preserving, keep tests green, never edit a test to pass during a refactor, list call sites before a cross-file change," travels with the repo and reaches every contributor's assistant. The tools support the small-step rhythm directly: Cursor's Plan Mode, toggled with Shift+Tab, writes a reviewable Markdown plan into .cursor/plans that you edit before the agent touches code, and Claude Code's plan mode (also Shift+Tab, or claude --permission-mode plan) reads the codebase without writing, so you agree the sequence of moves before any diff exists.

The durable principle to keep, whatever the size of the team: a refactor is only safe when behaviour is locked before you start and provably unchanged after you finish, and when each move is small enough that a human can read the diff. Start with one small thing. The next time you reach for a refactor on untested code, write the characterization tests first and refactor nothing until they are green.

Workflow

01Confirm the tests covering this code pass, and write characterization tests for any behaviour they leave uncovered.
02State the target structure and the hard constraint that behaviour must not change, naming the signatures, outputs, and error types that must stay identical.
03Ask the assistant to map every file and call site the move touches before it edits anything, so you can see the blast radius.
04Let it apply one move at a time, a single rename or one extraction, and run the full test suite after each move.
05Read the diff end to end and reject anything that alters a return value, an error path, or a side effect.
06Commit each green refactoring on its own, separate from any later behaviour change, so the history stays bisectable.
07Run the suite once more on the combined result and confirm the public interface is byte-for-byte what it was before you started.

Common mistakes

Refactoring code that has no tests, so a quietly shifted behaviour ships with no failing check to catch it.
Bundling a structure change and a behaviour change into one commit, which hides the real change inside the noise of the move.
Accepting a large rewrite because it reads cleaner, when the assistant has dropped an edge-case branch or inverted a condition the tests do not cover.
Letting the assistant rename across files without checking every call site, so one stale caller compiles but breaks at runtime.
Editing a test to make it pass during a refactor, which converts the safety net into a rubber stamp.

Examples

Scoped refactor prompt

Extract the validation block in order.ts into a pure function validateOrder(order). Behaviour must stay identical: same return shape, same thrown errors (RangeError for bad totals), same edge cases for empty carts. The existing tests in order.test.ts must still pass; do not edit them. First, list every file and call site this touches. Then show one diff for the extraction only. Make no behaviour changes. If you find a behaviour you cannot preserve, stop and tell me.

Characterization test request

parseTotals() in cart.ts has no tests and I want to refactor it safely. Do not change it yet. First write characterization tests that pin its CURRENT behaviour, including whatever it does on empty input, negative quantities, and missing fields. Run them so I can see they pass against the code as it stands today. Flag any output that looks like a bug; I will decide separately whether to keep or fix it.

Notes

This page covers behaviour-preserving structural change: locking behaviour, moving in small steps, and reading the diff. It does not cover adding features or fixing bugs, which are the adding-function hat and belong in their own commits.
Pairs with Giving AI the Right Context, where the constraint that behaviour must not change is part of the context, and with Reviewing AI Code Safely, where you check the diff the refactor produced.