You Cannot Review Your Way Out of This

Verification at AI scale is a selection problem, and every guardrail, eval, and playbook only works after a human decides what deserves attention.

Jun 16, 2026

Somewhere in your organization this week, an AI-drafted document was approved by someone who did not fully read it. Not from laziness; from arithmetic. Machine-generated drafts, summaries, analyses, and reports now arrive faster than the people responsible for them can check, and the gap widens every month, in every department, whether or not anyone has said it out loud.

Software engineering is simply where that gap is measured best, because no other knowledge work is so thoroughly instrumented. The numbers from that world deserve attention from everyone whose job produces documents rather than code, since developers are only the first to hit a wall the rest of us are approaching. LinearB’s 2026 benchmarks, built from 8.1 million pull requests across 4,800 engineering teams, found AI-assisted work waits 4.6 times longer for a first review than human-written work, and barely a third of it gets accepted, against 84.5 percent for the unassisted kind. A queue is forming in front of the one resource AI cannot multiply: a person willing to say this is correct.

The perception side is stranger. When METR ran a randomized trial with experienced open-source developers, participants believed AI had made them roughly 20 percent faster while the measured result was 19 percent slower. METR’s own 2026 follow-up suggests that slowdown has likely narrowed or reversed as the tools improve, so treat the direction as provisional; the durable finding is the gap itself. The people closest to the work could not tell whether it was going faster. If that is true in the most measurable profession on earth, consider the odds that anyone drafting contracts, reports, or board memos with AI knows their real number.

The standard interpretation of all this is that review is too slow. Leaders read these numbers and conclude they need faster reviewers, AI-assisted checking tools, better queue management. The vendors agree enthusiastically, because every one of those conclusions has a product attached to it.

The standard interpretation is wrong. The numbers describe a system where generation capacity and verification capacity have decoupled, and no amount of review acceleration reconnects them. Sonar’s 2026 State of Code survey of more than 1,100 developers found AI now accounts for roughly 42 percent of committed code, expected to reach 65 percent by next year. The same survey found 96 percent of developers don’t fully trust that code to be correct, only 48 percent always verify it before committing, and 38 percent say reviewing AI output takes more effort than reviewing a colleague’s work. Read those findings together. This is a machine that produces artifacts faster than anyone can check them, staffed by people who know the artifacts are unreliable and have quietly stopped checking anyway. Swap the code for contracts, research summaries, financial models, or marketing copy, and nothing in that sentence changes except the file format.

The question worth asking is what verification even means once output volume crosses review capacity. The answer is structurally different from what most organizations are building.

Two curves, only one of them moves

A recent economics paper on the automation frontier frames the problem as two competing cost curves. The cost of generating an artifact (code, a brief, a report, a vulnerability finding) falls exponentially as models improve. The cost of verifying that artifact stays roughly where it has always been, because it is bounded by human cognition. A senior developer can meaningfully review somewhere around 150 lines of code an hour. A careful reader of contracts, research memos, or financial models moves at a similarly fixed pace, and has for as long as those documents have existed. Attention does not get firmware updates, and there is no model release on any roadmap that changes it.

Engineers at Agoda reached the same conclusion from production data: AI tools measurably raised individual output while project-level velocity barely moved, because coding was never the real constraint. The constraint sits upstream, in specification and verification, the two activities that require human judgment. They frame this as a rediscovery of Fred Brooks’ forty-year-old argument that accelerating one stage of a pipeline buys you almost nothing if the binding stage is elsewhere. The industry has spent three years accelerating the stage that was never binding.

When one curve falls exponentially and the other stays flat, the gap between them compounds. This is the part most adoption conversations skip. An organization that responds to a compounding gap with linear capacity (more reviewers, longer review windows, a second approval step) has chosen to lose slowly rather than rethink the game. The arithmetic forecloses the strategy before the first hire is made.

The same Agoda piece offers a useful taxonomy of the postures available once you accept that. You can read every line AI produces, the white-box stance, which preserves full assurance and caps your throughput at exactly the 150 lines an hour you had before. You can ship whatever the machine generates, the black-box stance, which is fast right up until it meets a production system with real users and real consequences. Or you can verify selectively, at boundaries and against intent, accepting that assurance is now a budget to be allocated rather than a property to be assumed. The same three stances exist for every AI draft that crosses a desk, whether the desk belongs to an engineer, a lawyer, or an analyst, and whether anyone has named them or not. Most organizations are running the third stance already. Few have admitted it, and fewer still have decided deliberately where the budget goes, which means the allocation is being made anyway, by default, by whoever is most tired on a given afternoon.

I work inside a national law firm, where this arithmetic has a sharper edge than it does in software. A pull request that ships with a subtle bug costs you an incident. A factum that ships with a fabricated citation costs you a professional reputation, and possibly a client. The stakes are asymmetric: the price of being wrong exceeds the benefit of being right, which means verification cannot be quietly abandoned the way half of those surveyed developers have abandoned it. Regulated industries never get the option of not checking; the only choice available is what to check.

That phrase, choosing what to check, is the actual answer to the scaling question, and everything else in this piece is an elaboration of it.

What the tools actually do

The toolchain that has grown up around this problem is usually marketed as verification capacity. Guardrails, evals, LLM-as-judge pipelines, formal methods, provenance tracking. The framing is that these tools let you check more. Watch how each one actually works and something close to the opposite is true: each earns its keep by deciding what will never be examined, and by making that decision survivable.

Guardrails verify boundaries instead of artifacts. A guardrail doesn’t ask whether an output is good; it asks whether the output crosses a line you drew in advance: leaks data, violates a policy, exceeds a risk threshold. Production guardrail systems, like the centralized service Singapore deployed for its public chatbots, work precisely because they refuse to evaluate quality. They evaluate violations, which is a vastly smaller problem. The honest description of a guardrail is a decision about which failures you can afford to detect cheaply and which ones you’ve accepted you won’t catch at all.

Evals and judge pipelines apply the same selective logic to quality. Rubric-based evaluation has become the working standard for assessing open-ended AI output at scale, and it earns its place. But a judge model scoring outputs against explicit criteria is a statistical instrument. It reports on the distribution, calibrated periodically against human experts, carrying known biases toward verbose answers and toward output resembling its own, and it stays silent on whether any individual artifact is correct. Teams that treat eval scores as verification have confused quality control sampling with inspection, and the difference matters enormously when one bad artifact can hurt you.

With specifications, the selection happens upstream. The most interesting shift in engineering practice right now is the argument, made well in Aviator’s analysis of the review bottleneck, that the only verification with an external reference point is verification against a spec a human has approved. Tests check behavior the test author imagined. A reviewer checks a diff against their reconstruction of what it was supposed to do. A spec is the one artifact in the pipeline where intent lives outside the generation loop, and verifying against it means a human did the hard thinking once, up front, so the checking inherits that thinking.

Formal methods verify guarantees instead of confidence. Leonardo de Moura, who built the Lean theorem prover, draws the line cleanly: testing provides confidence, proof provides a guarantee, and it is genuinely hard to quantify how much confidence testing actually buys you. Proof is expensive and narrow; you reserve it for the properties where confidence isn’t good enough. Which is, again, a selection decision.

Every tool that works at scale works by shrinking the verification surface, and every shrinking decision is a human judgment about what matters. The tools carry out triage decisions. They cannot make them.

When the verifier needs a verifier

The obvious move, once human review can’t keep pace, is to have AI verify AI. It is also the move with the nastiest failure mode, and the failures are no longer hypothetical.

SRLabs’ analysis of AI-generated security findings describes what happens when an organization layers AI security tooling on top of AI-generated code and treats the findings as decisions: the system becomes self-referential, producing artifacts faster than anyone can anchor them in an actual threat model. Teams under that load optimize for closing items rather than reducing risk, because closed items are measurable and risk reduction isn’t. The curl project reached the logical endpoint in January, ending its bug bounty after years of fielding AI-generated vulnerability reports that looked plausible and dissolved under scrutiny. The bounty had become a denial-of-service attack on maintainer attention.

Academia hit the same wall from a different direction. An audit of NeurIPS 2025 submissions documented a hundred fabricated citations that sailed through peer review, including placeholder hallucinations literally citing “Firstname Lastname,” because reviewers check methodology and novelty, and nobody’s job has ever been confirming that cited papers exist. The gap was always there. AI made it economical to exploit at volume. A related analysis of peer review itself puts the conclusion bluntly: when the rate of claims rises exponentially against fixed human bandwidth, collapse is a mathematical inevitability, and abstaining from AI assistance doesn’t preserve the system’s integrity; it just guarantees the system drowns.

And the judge models themselves are attack surface. Security researchers have shown judges that ignore their instructions when fed adversarial output, repeating attack strings instead of evaluating them. The fix proposed in that research is a guardrail on the judge — which should give you pause. We are now building verifiers for the verifiers, and there is no level of that tower where the regress stops on its own. It stops where a human owns an assessment and signs their name to it. Nowhere else.

Verification as deterrence

Earlier this year, in a long conversation about verification architecture, I landed on a framing I haven’t been able to shake: at scale, verification stops behaving like a truth-recovery problem and starts behaving like a deterrence problem. Closer to mutually assured destruction than to forensic investigation.

Truth recovery assumes you can, in principle, examine an artifact and determine whether it is correct. That assumption holds at human scale. It fails at machine scale, for the cost-curve reasons above, and the failure is permanent because the gap compounds. What remains achievable is a different posture entirely: making bad artifacts expensive to produce, cheap to contain, and traceable to a producer who bears the cost.

Look at what actually works in the systems under the most pressure and you find deterrence mechanics where you would expect inspection. Curl didn’t get better at detecting slop reports; it changed the economics of submitting them. The proposed fix for fabricated citations is mandatory existence checks rather than smarter reviewers, shifting the verification burden onto the claim, at the point of submission, where it costs the producer instead of the reviewer. Spec-driven development works because the spec makes intent auditable, which makes deviation attributable. And provenance tracking (who generated this, from what, under what instructions) won’t establish that an artifact is correct, but it establishes who answers for it if it isn’t, and that knowledge changes behavior upstream of any review.

The legal profession produced the cleanest illustration on record just this week. On June 8, a federal judge in Mississippi sanctioned every lawyer of record in a contract dispute after filings from both sides turned out to contain fabricated citations. The two out-of-state lead attorneys admitted using AI without verifying the output; local counsel had signed the briefs without reviewing them. The court revoked both out-of-state lawyers’ temporary admissions, barred them from appearing before the Northern District of Mississippi for two years, fined all four attorneys, cancelled the trial, and referred everyone to their state bars. The instructive part, read as a verification story, is the remedy the court chose: no call for better detection tooling, just consequence attached to the signature. The ruling’s principle, that responsibility “remains the sacred duty of the lawyer who signs the page,” is deterrence doctrine written by a judge who understood the problem more clearly than most AI strategy decks do.

The pipeline I sketched afterward runs adversarial passes from different model families against a primary output: fact checks, logic checks, an audit of unstated assumptions. The design process taught me something I didn’t expect. Certifying outputs as true was never on the table; no architecture I could draw gets you there. What a design like this can do is make certain classes of failure reliably expensive to get past, which changes what is worth attempting in the first place. Deterrence, functioning exactly as deterrence does.

This reframe matters because it redirects investment. An organization pursuing truth recovery buys review capacity and falls further behind every quarter, while one pursuing deterrence builds chokepoints, assigns ownership, and engineers consequence — and the second approach scales, because consequence does the verifying for you, continuously, at every point of production you can’t see.

The part no tool covers

All of which leaves the question every playbook quietly assumes is already answered: who decides what deserves verification in the first place?

Every mechanism in this piece runs on a prior human judgment. Which boundaries the guardrails enforce, what the rubric rewards, where formal proof is worth its cost, and what gets sampled, gated, or waved through. The tooling industry talks about these as configuration details. They are the entire game. A perfectly engineered verification stack pointed at the wrong things is expensive theater, and the security teams optimizing for closed findings while risk accumulates elsewhere are running exactly that theater right now.

The judgment involved is the capability I keep circling in this newsletter: signal discrimination. Knowing, when you face more artifacts than you can examine, which ones matter: by consequence, by blast radius, by the asymmetry between what a wrong artifact costs and what a right one earns. In my world, an AI-drafted internal summary and an AI-drafted court filing might come from the same tool on the same afternoon, and they belong in entirely different verification regimes. Nothing in the tooling knows that. A person has to, and that person has to be willing to own the call, which in most organizations is the genuinely scarce resource. The technology of verification is improving fast. The willingness to sign one’s name to “this is checked enough” is not, because the signature carries all the downside and none of the credit.

So the real playbook, stripped of vendor language, is short. Decide what failure costs, per artifact class, before deciding what verification it gets. Push the burden of proof onto the producer wherever you can: specs, provenance, existence checks at the point of claim. Use machines to sample distributions and patrol boundaries; reserve humans for the artifacts where consequence is asymmetric. And put a name on every assessment, because an unowned verification is a rumor with a checkmark on it.

None of that is a technology roadmap. It is judgment, exercised in advance, encoded into a system — which is to say it is work most organizations have not done, and the tools they’re buying cannot do it for them. The verification crisis everyone can now measure looks less like a case for better checking machines than like a bill arriving for a question that got skipped: what, in all this output, actually matters?

AI doesn’t make you better. It exposes whether you were already doing the work.

If this kind of analysis is useful, subscribe — one piece like this every week, written from inside an institution doing the work in real time.

Andrew Lewis was Here

Discussion about this post

Ready for more?