Skip to main content
Back to build log
Ai Coding

How We Measure Claude Code Acceptance Rate (and What 90 Days Tells Us)

Failures
1
Outcome
Acceptance rate measured per-commit, per-cluster
Claude CodeNext.js 14TypeScriptSupabaseGitHub
ByDOT· Founder @ DOTxLabs
Published May 7, 20266 min read
TL;DRThe methodology DOTxLabs uses to track Claude Code suggestion acceptance across shipped commits — what we count, what we don't, and what 90 days of internal measurement shows.

How We Measure Claude Code Acceptance Rate at DOTxLabs

We instrument Claude Code suggestion acceptance per shipped commit using a lightweight tally — accepted means landed-on-main with at most cosmetic edits, rejected covers everything else including ignored, dismissed, or rewritten. Across 90 days of internal tracking, the rate sits in the 60–80% band depending on cluster, with structural code (routes, schemas, RLS policies) trending higher than UI-detail work. The number tells us about the human-plus-tool system, not the tool alone.

Most acceptance rates published by AI coding tools are vendor-friendly numbers from instrumented IDE telemetry. They count a "suggestion accepted" the moment it appears on screen and the cursor moves on. That's not the same thing as code that ships.

This is how DOTxLabs measures it for our own work, and what 90 days of measurement has actually shown us.


What we count

A Claude Code suggestion is accepted if it lands in a commit on main with edits limited to:

  • Variable or function renames
  • Whitespace, semicolon, or import-order adjustments
  • Moving a block to a different file without changing its contents
  • Adding or removing a single comment

A suggestion is rejected if any of the following are true:

  • Dismissed in-editor (Esc, Escape gesture, fast reject)
  • Replaced before the file was saved
  • Saved but rewritten meaningfully before the commit
  • Saved into a draft branch that didn't merge
  • Kept structurally but with the actual logic replaced

The asymmetry matters. "Kept structurally" — where Claude Code suggested a function with the right name, signature, and shape but the wrong implementation — counts as rejected. The structure is often the cheap part. Getting the implementation right is the work.


How we tally

We tried two approaches. The first one didn't survive contact with reality.

What didn't work: end-of-day self-report

The first 14 days, we asked ourselves at the end of each session: "roughly what fraction of Claude Code's suggestions did you accept today?" The numbers were uniformly in the 80–90% range. Suspiciously high.

The reason was survivorship bias. Suggestions that we rejected within 1–2 seconds — the ones we read, decided no, and dismissed before our brain even logged the interaction — were invisible in self-report. We were only counting suggestions that had survived long enough to be remembered, which biased the sample toward the ones we accepted.

We caught this by accident, when one of us happened to glance at the editor history during a session and counted thirteen dismissals in twenty minutes — none of which we'd remembered as suggestions at all.

What works: per-suggestion tally hotkey

Two keystrokes:

  • Cmd+Shift+A — log an acceptance
  • Cmd+Shift+R — log a rejection

Both write a single line to a per-day JSON log:

{ "ts": "2026-05-07T14:23:11Z", "kind": "accept", "cluster": "schema", "scope": "build" }
{ "ts": "2026-05-07T14:23:18Z", "kind": "reject", "cluster": "ui-detail", "scope": "build" }

The cluster is one of seven: schema, route, migration, rls, ui-detail, state, infra. The scope is build (work that's actually shipping) or experiment (we're trying something speculative). We exclude experiments from the headline number because they're not representative — the speculative-work acceptance rate is much lower and skews the average.

Once a week, an aggregator script reads the daily logs and computes per-cluster rates against the commits that landed that week. The "landed" check is what catches the difference between "we kept it in the editor" and "it shipped to production." Anything that didn't make it to a main-merged commit gets recounted as rejected, regardless of what we tallied in the moment.


What 90 days has shown

Three things, with varying degrees of confidence.

The headline number sits in a 60–80% band. Across 90 days of build work, the per-week acceptance rate has not gone below 60% and has not gone above 80%. Specific weeks vary inside that band based on the work mix. We do not publish a single point estimate because the variance is real and a single number would be misleading.

Cluster matters more than the model version. Schema work, RLS policies, and database migrations consistently land in the upper end of the band — Claude Code is well-calibrated when there's a clear contract. UI detail work — exact spacing, color choices, micro-interaction tuning — sits in the lower end, because what we're optimizing for is taste, not correctness, and taste is harder to communicate in a prompt than to demonstrate.

The rate has drifted up. The first 30 days were closer to the 60% end of the band. The latest 30 days are closer to the 80% end. We don't think the underlying model has changed enough in 90 days to explain that. The likelier explanation is that we got better at the system around the tool — better at scoping prompts, better at structuring repos so context is reachable, better at recognizing in advance which work to hand to Claude Code and which work to handle ourselves. The acceptance rate is not measuring the tool. It's measuring the human-plus-tool system.


What the measurement doesn't tell us

Three blind spots we know about.

It doesn't measure the suggestions we never asked for. A higher acceptance rate when we ask less often is not the same thing as a more capable tool. We track ask-rate alongside acceptance, and the ratio is what's actually informative — but it's noisier and we haven't published a clean methodology for it yet.

It doesn't measure quality of accepted code. We accept a suggestion if it ships. We don't currently track whether the code we accepted needed refactoring within the next 30 days. We're starting that measurement in May. Early sense: most accepted code is fine; some accepted code reveals itself as the wrong abstraction once a second use case lands. The rate of "regretted acceptances" is something we want a number on by the next quarter.

It doesn't generalize. Our acceptance rate is for our codebase, our voice, our review standard. Reading our number and using it to estimate yours is the same mistake as reading the vendor's marketing number and using it to estimate yours. The methodology generalizes; the number does not.


Why this is in the Build Log

Because the numbers in our quotes — how long a build will take, what the AI-leverage scope is, where the human time has to go — are downstream of this measurement. Without our own acceptance-rate data, we'd be quoting against vendor headline numbers, which would mean quoting low and absorbing the gap. We'd rather measure than guess.

If you're running an AI-assisted team and don't have your own number, you're calibrating against someone else's marketing. The fix is two hotkeys and a JSON log. It pays for itself inside a week.

If you want help wiring this kind of measurement into your team's workflow — or if you're trying to figure out where Claude Code earns its keep on your stack and where it doesn't — book a Stack Diagnostic.

Frequently asked questions

  • What counts as an accepted Claude Code suggestion?

    We count a suggestion as accepted when it lands in a shipped commit on main with no more than light edits — typically renaming variables, adjusting whitespace, or moving a block. Anything that required a meaningful rewrite, or that we kept just for the structure but rewrote the logic of, we count as rejected.

  • Why measure this at all?

    Because the headline acceptance rates published by tool vendors don't match what we see in production work. Without our own measurement, we'd be calibrating estimates and pricing against numbers that don't reflect how we actually ship. The internal number is what determines whether we keep paying for the tool, where we deploy it, and how we scope time on builds.

  • What's the biggest source of measurement error?

    Survivorship bias. Suggestions that get rejected fast — within 1–2 seconds of appearing — are easy to miss in self-reporting. We caught this by adding a tally hotkey for fast rejects; once we did, our reported rate dropped meaningfully. Don't trust acceptance rates that aren't instrumented.

  • Has the rate changed over the 90 days?

    It's drifted upward, but we don't think the model improved that much in 90 days. The likelier explanation is that we got better at scoping the prompts and at structuring the codebase so Claude Code has the context it needs. The rate is a measure of the human-tool system, not the tool alone.

Want this pattern in your stack?

We build production AI-first systems. If something here looked like the shape of a problem you have, let's scope it.

Book a Stack Diagnostic