Guide · AI in the Developer Workflow

Debugging with AI

A bug is a question about why your code behaves the way it does, and the assistant predicts a plausible answer from the context you give it. You keep control by agreeing on the cause before any code changes, so the assistant ranks hypotheses and your verification check, a log line, a breakpoint, or a failing test, decides which one is real.

DebuggingHypothesesReproduction

~10 min

When to use this

You have an error or a failing test and want to reason through the cause and confirm it before changing code.
A bug is intermittent or hard to reproduce, and you need structure rather than a random change.
A stack trace points at a framework or a dependency and you cannot tell which of your own lines actually triggered it.
You fixed the symptom once already and it came back, which means the real cause was never confirmed.

Key ideas

Understand before fixing: Ask what the error actually means and which line raised it before asking for any change. The assistant is strong at translating a cryptic trace into plain language, which is the cheapest first step. A misread stack trace sends the model confidently down the wrong path.
Reproduce first: Pin down the exact input, command, and conditions that trigger the bug. A failure you cannot reproduce is a failure you cannot confirm you fixed. Reproduction is also the only honest test of any fix, so it earns its time twice.
Hypotheses, then evidence: Have the model list the likely causes ranked most likely first, then check each one against logs or a small experiment. Let the evidence decide which cause is real, even when another answer sounds more confident. Pattern matching across millions of repos is not causal reasoning about yours.
Runtime beats inspection: For timing, state, and race conditions, a printed value at the failing line beats any amount of reading. Cursor Debug Mode formalizes this in 2026: it adds log statements, you reproduce, and it reads the captured runtime data before proposing a fix. Confirm with the running program, not the explanation.
Lock the fix with a test: Write a regression test that fails before the fix and passes after it. That test is the proof the cause was real and the guard that the bug stays dead. A fix with no failing-then-passing test is a guess that happened to quiet the symptom.

Why this matters

The expensive way to learn this lesson is to skip straight to a fix. A developer hits a red stack trace, pastes it into the assistant, and types "fix this." The assistant predicts a plausible patch, the test that was red turns green, and the work moves on. Two days later the same failure resurfaces in production on a slightly different input, because the change quieted the symptom without ever addressing the cause.

This happens for a specific reason. The assistant predicts plausible output from patterns across millions of public repositories. Plausible is exactly the dangerous kind, because the explanation reads as confident and complete even when it is wrong about your code. Pattern matching across many codebases is a different operation from causal reasoning inside one. Your bug lives in your specific state, your versions, and your data, none of which the model can see unless you put it in the context.

Consider parseDate() in a real repo. It throws on inputs that carry a timezone offset like "2026-03-14T10:00+02:00". Ask for an immediate fix and the assistant may wrap the parse in a try/catch and return a fallback date. The exception disappears, the obvious test passes, and a silent wrong date now flows downstream into reports. The trace went quiet; the bug went into hiding.

The payoff of doing it as a method is that the cause is confirmed with evidence before a single line changes, so the fix is the right one and a regression test keeps it fixed. The discipline is the same one good debuggers have always used, and the assistant accelerates each step: it reads a trace faster, it lists more hypotheses, it writes the probe and the test for you. The mechanism that turns "fix this" into a reliable cycle is the analyze-first loop, which the next section breaks down.

How it works

Debugging with an assistant is a loop with distinct stages, and skipping a stage is where it goes wrong. The durable principle in 2026 is analyze first, fix second: force the model to explain and rank before it proposes any change.

Understand. Ask what the error means and which line raised it. Translating a cryptic trace into plain language is the model's strongest move, and it costs almost nothing.
Reproduce. Pin the exact input, command, and conditions. A reproduction is the smallest reliable recipe that triggers the failure. Without it you cannot confirm a fix worked.
Hypothesize. Have the model list the likely causes, ranked most likely first. Ranking forces it to commit to an order you can test rather than hedge across everything.
Confirm. Pick one hypothesis and test it with a single verification check: a printed value, a breakpoint, or a narrowed input. The evidence, not the confidence of the prose, decides which cause is real.
Fix and lock. Make the smallest change that addresses the confirmed cause, rerun the reproduction, then add a regression test that fails on the old code and passes on the new one.

The distinction that decides success is inspection versus runtime evidence. Reading code is enough for many logic bugs. For timing, shared state, async ordering, and race conditions, a value printed at the failing moment beats any amount of reading, because the bug only exists while the program runs. This is why Cursor shipped Debug Mode in 2026: the agent adds log statements that stream to a local debug server, you reproduce the bug, and it reads the captured logs to find the root cause from runtime evidence before making a targeted fix, then removes the instrumentation. Claude Code reaches the same place differently, with a debugger subagent that keeps the Edit and Bash tools so it can reproduce, patch, and retest in a tight loop; without Bash it can theorize but never confirm.

The trap built into the loop is what one team named the plausible hypothesis trap. The model's top-ranked cause sounds authoritative, so you implement it and skip the confirmation step. When the symptom happens to vanish, you bank a fix you never proved. The guard is mechanical: never apply a hypothesis you have not confirmed with a check that would have ruled out the others. The next section walks one bug through the whole loop so the steps stop being abstract.

A worked scenario

A developer, Maya, owns a small utility module in a Node service. A user reports that timestamps from a partner API show up an hour off in the daily export. The failing call is parseDate("2026-03-14T10:00+02:00") in src/utils/date.ts, which throws on inputs with a timezone offset. Here is the loop she runs.

Gather the context. Maya pastes the exact error, the full stack trace, the body of parseDate(), the recent diff that touched it, and the environment: Node 22, the failure is consistent, only on inputs containing a +HH:MM offset.
Understand before fixing. Her first prompt ends with "Do not fix it yet. Explain what the error means and which line raises it." The assistant identifies that an offset regex returns null, and the next line reads a property off that null.
Rank the hypotheses. She asks for the three most likely causes, most likely first. Top: the regex matches Z and bare times but not the +HH:MM form. Second: the offset is parsed but applied with the wrong sign. Third: the input string is trimmed incorrectly upstream.
Confirm with one check. She asks for the single smallest probe. The assistant suggests logging offsetMinutes just before it is used. She reproduces with the exact partner string and the log prints NaN, which matches hypothesis one and rules out the wrong-sign theory, since a sign error would still produce a number.
Make the smallest fix. "Confirmed. Handle the +HH:MM offset form. Do not touch the other branches." The diff is a few lines that extend the regex and convert hours and minutes to signed minutes.
Lock it and clean up. She has the assistant write a test that passes "2026-03-14T10:00+02:00" and asserts the UTC result. It fails on the old code, passes on the new. She deletes the temporary log line, then commits the fix and the test together, noting the cause in the message.

The whole session took fifteen minutes and the export is now correct for offset inputs, with a test that will catch any regression. The contrast is sharp: the "just fix it" version would have wrapped the parse in a try/catch, returned a fallback, and shipped a silently wrong hour into every export. Maya and her offset bug carry into the pitfalls below, because the loop has sharp edges that the clean run hid.

Pitfalls and edge cases

Each trap below feels like progress in the moment while quietly breaking the work, so the fix lives in the same item.

Skipping reproduction. Fixing a bug you cannot trigger on demand means you cannot tell whether your change worked or whether the failure simply did not happen to occur. Build the reproduction first, even a one-line script, and keep it as the test.
Trusting the confident explanation. The top-ranked cause reads as fact. Treat every ranked hypothesis as a claim that needs a check that would have falsified the alternatives, and discard the ones the evidence rules out.
The shotgun fix. Letting an agent edit five files chasing one theory leaves you unable to say which change fixed the bug or what else it broke. Change one thing, rerun the reproduction, then decide the next step.
Symptom over cause. A try/catch that swallows the error, or a default that hides a bad value, makes the trace go quiet without addressing why the value was wrong. Fix the cause the evidence pointed to, not the place the error surfaced.
Leaving instrumentation behind. Log lines and debug servers added during the hunt are noise in the diff and can leak data. Remove them before you commit, the same way Cursor Debug Mode strips its own logs once the fix is verified.

Two genuine edge cases stretch the simple loop. The non-deterministic bug. Race conditions and timing failures do not reproduce on demand, so a single run proves nothing. Here runtime instrumentation is the only honest tool: add logging that captures the order of events and the shared state, then reproduce several times until the bad interleaving shows up in the captured data. A fix you "confirmed" on one lucky green run is not confirmed at all.

The hallucinated fix. A 2026-specific wrinkle is that the assistant may propose importing a package to solve the bug, and the package does not exist. The USENIX Security 2025 study of 2.23 million generated samples found 19.7% referenced at least one hallucinated package, and attackers register those names to attack the people who copy the suggestion. Confirm any new dependency resolves to a real, intended package on the registry before you install it. These edge cases get harder once more than one person owns the code, which is where the next section goes.

Debugging on a team and at scale

One person can hold a debug session in their head. The moment a teammate has to reproduce the bug, review the fix, or pick up where you left off, the session needs a durable record, or the knowledge evaporates the moment the trace scrolls away.

The cheapest durable artifact is a short debugging note attached to the issue or PR: the symptom, the exact reproduction, the hypotheses considered, which one the evidence confirmed and how, and the fix. A lead, Priya, makes this part of the team's definition of done for a bug fix, so the next person who sees a similar trace finds the reasoning rather than just the patch. The same note belongs in the PR description, which should record that an assistant proposed the fix and the prompt that found the cause, so a human reviewer can repeat the verification rather than trust the diff.

Scale also shapes how much you let an agent run unattended. Claude Code's debugger subagent runs in its own context and cannot show a permission prompt, so background subagents auto-deny new tools; the safe pattern is to keep exploration read-only and defer Edit and Bash changes to the parent session where a human can see them. Cursor's Bugbot reviews pull requests automatically, and Cursor reports that at its default effort about 80% of the bugs it identifies are resolved by merge time in 2026, which makes it a useful first pass. Cursor's own framing keeps the human in the loop here, since the team tracks an acceptance rate (the share of Bugbot findings a reviewer approves), and the widely repeated guidance around these tools is to let human review make the call. Treat any automated debugger, the model's own or a review bot, as a draft that ranks suspects, then have a person confirm the cause and decide what ships.

Keep the fixes small for the same reason you keep the hypotheses single. A bug fix that arrives as a focused diff under a few dozen lines is one a reviewer can actually read and reason about; a sprawling change buries the real correction among incidental edits and overwhelms the review that is supposed to catch the next mistake.

The durable principle to keep, whatever the size of the team: confirm the cause with evidence before you change code, and lock the fix with a test that failed a moment ago. Start with one small thing on your next bug. Before you ask for a fix, ask what the error means and which line raised it, and do not let the assistant touch the code until you can name the cause yourself.

Workflow

01Paste the exact error, the full stack trace, the code around the failing line, the recent diff, and the environment (language version, OS, key dependency versions).
02State the steps to reproduce and whether the failure is intermittent or consistent, then ask what the error means before requesting any fix.
03Have the assistant rank the likely causes most likely first, and test the top one with a single log line, a breakpoint, or a narrowed input.
04Read the runtime evidence and confirm which hypothesis it supports, discarding the ones the evidence rules out even if they sounded right.
05Apply the smallest change that addresses the confirmed cause, then rerun the exact reproduction to verify the behaviour actually changed.
06Add a regression test that fails without the fix and passes with it, and remove any temporary log lines or instrumentation you added.
07Commit the fix and the test together in one small, reviewable step that records the cause and the prompt that found it.

Common mistakes

Pasting an error and asking it to "just fix this" before anyone has reproduced the bug or read the trace, so the fix targets a guess.
Accepting the first plausible explanation because it sounded confident, without a log line or test that actually rules the others out.
Applying the suggestion and moving on without rerunning the reproduction, so you never confirm the symptom is gone for the right reason.
Letting an agent edit several files chasing a hypothesis, then being unable to tell which change fixed it or what else it broke.
Fixing the visible symptom and skipping the regression test, so the same bug returns the next time the code path runs.

Examples

Debug prompt

Here is the error, the full stack trace, and parseDate() in src/utils/date.ts. It fails only on inputs with a timezone offset, on Node 22, consistently. Do not fix it yet. Explain what the error means and which line raises it. Rank the three most likely causes, most likely first. Then tell me the smallest log line or test input that would confirm the top one.

Confirm before the fix

The log shows offsetMinutes is NaN for "2026-03-14T10:00+02:00", which matches hypothesis 1 (the offset regex never matches the "+HH:MM" form). Confirmed. Now make the smallest change that handles "+HH:MM" offsets. Do not touch the other branches. Then write a test that fails on the old code and passes on the new one.

Notes

This page covers debugging an existing bug end to end: understand, reproduce, hypothesize, confirm, fix, and lock with a test. It does not cover production observability stacks or agent-evaluation tooling, which are their own topic.
Pairs with Giving AI the Right Context, since a debug session is only as good as the trace and diff you supply, and with Reviewing AI Code Safely, which is where the proposed fix gets its final read.