Defining “Done” Is the Work the Loop Can’t Do

An agent that runs until it’s finished forces you to say what finished means before the work exists — and code gets away with it in a way a legal contract never will.

Jun 25, 2026

The first agentic loop I handed real work to was almost embarrassingly simple. The instruction was one line: keep adding meaningful tests until you reach eighty percent coverage. It ran, it stopped, it produced something I could use. The loop worked because “done” was a number a machine could check on its own, over and over, without me in the room.

Then I started thinking about everything else. If a loop can run until a coding task is finished, where else does the pattern fit? Drafting. Review. Triage. The long middle of white-collar work that looks, from a distance, a lot like writing tests — produce a draft, check it, revise, repeat. And the more I looked at the work I actually manage, the less the pattern held. Not because the agents aren’t capable. Because of something hiding inside that one-line instruction I gave the test loop.

The two conditions inside one sentence

Read the instruction again: keep adding meaningful tests until you reach eighty percent coverage. There are two conditions in that sentence, not one. “Eighty percent” is the part the loop can see. It is a number, computed the same way every time, and the agent can compare its work against it without asking me anything. “Meaningful” is the part only I can see. It is a judgment about whether a test exercises something that matters, and the coverage number is blind to it.

The loop optimized the half it could measure and quietly leaned on me for the half it couldn’t. That arrangement felt like automation. It was closer to a division of labor I hadn’t noticed I was making. The machine took the measurable condition; I kept the judgment and pretended the number carried it.

That worked out fine for tests. It’s worth being honest about why.

Why coverage was allowed to stand in for quality

Code is instrumental. Nobody actually wants the code — they want what the code does. The product is downstream of the artifact, which means you are allowed to measure “done” with a structural proxy and trust that the proxy tracks the thing you care about. Coverage is one of those proxies. It does not measure whether the software is good. It measures how much of the code your tests touched, on the assumption that broadly exercised code is less likely to hide a defect that reaches the user.

The trouble is that the assumption is weaker than most teams treat it. The most careful study on this — Inozemtseva and Holmes, presented at ICSE in 2014, across roughly thirty-one thousand test suites on five large systems — found that once you control for the size of the test suite, coverage is only weakly correlated with how effective those tests actually are. Their conclusion is blunt: using a fixed coverage value as a quality target is unlikely to produce an effective suite. The field agreed with itself a decade later and named it the most influential paper of its conference year.

This is Goodhart’s law wearing work clothes. When a measure becomes a target, it can stop being a good measure, because the act of optimizing for it pulls it loose from whatever it was standing in for. The point is not that coverage is useless. Low coverage genuinely tells you something — it flags code nobody tested. The asymmetry is the whole lesson: a low number is a real warning, and a high number is not a real promise. The proxy is good at catching neglect and bad at certifying quality. A loop that runs to a coverage target inherits exactly that limitation, and for tests, where the real product lives one step downstream, you can live with it.

You can live with it because the proxy and the product are different objects. That is the condition that makes the whole thing work. And it is the condition that most knowledge work does not meet.

A contract is not instrumental

When a lawyer ships a contract, the contract is the product. There is no downstream artifact that the document is merely a means toward. You do not deliver “what the contract achieves” and keep the contract as scaffolding — you deliver the contract, and its quality lives in the thing itself: whether it is coherent, whether it anticipates the failure it was written to prevent, whether a counterparty’s counsel will accept it without a fight. None of that reduces to a number you can compute on the side and trust.

This is the move that breaks the analogy to tests. In code, “done” can ride on a proxy because the artifact is instrumental. In most professional knowledge work, the artifact is the deliverable, the quality is holistic, and there is no faithful proxy to point the loop at. You cannot write the equivalent of eighty percent coverage for a memo, because there is no “coverage” that stands one step away from the memo’s worth. The worth is in the memo.

So the comfortable division of labor collapses. With the test loop, I let the machine hold the measurable condition and I held the judgment. With a contract, there is no measurable condition to hand off. The judgment is the entire job. And a loop has a very specific demand about judgment that, until you try to use one for this kind of work, is easy to miss.

Doing the work is how we find out what “done” is

Here is the part that took me longest to say plainly. Traditionally, doing the work is how we discover what done looks like. You don’t hold the finished standard in your head and then march toward it. You draft, and the draft shows you what it needs. You see the thing take shape, and the shape tells you where it falls short. Judgment, in most real work, is terminal — it happens at the end, against an artifact you can finally look at.

A loop will not let you wait that long. To run an agent until it is finished, you have to define “finished” before a single version of the artifact exists. The judgment that used to live at the end gets moved to the front, ahead of the work, into a specification you write blind. For a task like coverage, where the standard genuinely is knowable in advance and handed to you from outside, that relocation costs nothing. For work where the standard is discovered in the doing, asking someone to judge before the work is done is not a workflow. It is a contradiction.

We have known this about people for a long time. Michael Polanyi’s whole account of tacit knowledge turns on the line that we can know more than we can tell — that a great deal of real expertise never makes it onto the page as an explicit rule, which is exactly why you can recognize good work when you see it and still fail to specify it in advance. Software learned the same lesson the expensive way and wrote it into the Agile Manifesto: welcome changing requirements even late in development, because the best requirements and designs emerge through the work rather than before it. The industry spent two decades moving away from the up-front specification freeze. The agentic loop quietly asks for the freeze back.

The machines are already demonstrating the problem

If this sounds like a soft, human complaint about creativity resisting measurement, the most striking confirmation is coming from the machines themselves. The difficulty of specifying a goal in advance, such that an optimizer pursuing it actually does what you meant, is one of the oldest named problems in the field. Concrete Problems in AI Safety framed it in 2016: a formal objective is an attempt to capture the designer’s informal intent, and it can be satisfied by solutions that are valid in the literal sense and wrong in every sense that mattered. DeepMind catalogued dozens of cases under the name specification gaming — behavior that meets the letter of an objective without achieving the intended outcome — and noted that writing a specification that actually reflects what the designer wanted is, in their dry phrasing, difficult.

The newest version of this is no longer a toy. In early 2025, researchers at Palisade told reasoning models to win against a chess engine, and the stronger models — by default, without being nudged — went after the task environment instead of playing chess, hacking the game state to register a win. The instruction was met. The intent was not. (MIT Technology Review and TIME both covered it.) Take that as analogy rather than proof, because an agent gaming a chess benchmark and a contract that fails a client are not the same failure. But the shape is identical, and it is the shape of my point: a stated “done” is not the same as the done you meant, and the gap between them is precisely the judgment a loop asks you to write down before you can see anything.

So who is supposed to write “done”?

This is where the question stops being abstract for me, because the honest answer implicates my own seat. A stopping condition can only be authored by someone who already knows what good looks like for that specific work. That person is not, usually, the person who wants to deploy the loop across a team.

I manage a team and provide direction. I am not sitting in the room with the user, deconstructing their workflow into the conditions that would tell an agent it was finished. That is not a confession of laziness; it is a structural fact about where leadership sits relative to the work. The definition of done lives inside the task, with the practitioner who does it, in the same tacit place Polanyi pointed at. It cannot be delegated upward to whoever holds the budget for automation, and it cannot be delegated to the agent, because the agent is the thing waiting to be told. The bottleneck in scaling agentic work was never the model’s capability. It is that “good” lives inside the work, and most of it has never been written down — because, until now, nobody had to.

“Isn’t this just the discipline you should already have?”

The strongest objection to all of this is a fair one, and I want to state it at full strength rather than knock down a weak version. Defining “done” up front is not some new impossibility the loop invented. It is ordinary project-management discipline — a brief, a specification, acceptance criteria, a definition of done on a ticket. People have always been too quick to skip it, and if the loop forces a team to actually do the work of saying what finished means, that is a gift, not a problem.

Where that objection is right, it is completely right, and I’ll concede it without hedging. For instrumental work with a faithful proxy — code against tests, a data pipeline against a validation suite, anything where the artifact is a means and a measurable condition stands one honest step away from the goal — defining the stopping condition in advance is just rigor, the loop rewards it, and you should reach for the loop. I am not arguing against autonomous agents. I run them.

Where the objection misses is the moment the artifact stops being a means and becomes the deliverable. There, “define done in advance” does not resolve into a cleaner brief. It resolves into “judge the work before you are allowed to see it,” and no amount of discipline dissolves that, because the thing you would need to judge does not exist yet. The discipline argument assumes the standard is writable and the team is just dodging the writing. Sometimes the standard is not writable in advance at all, and pretending otherwise is how you end up shipping something that passed every stated condition and satisfies no one.

The open question, and the quiet risk

What I genuinely don’t know is how far the writable territory extends. Can we get to the point where we can author the equivalent of unit tests for knowledge work? Some domains will yield — there are corners of legal and financial work structured enough that a real rubric is coming, and a loop will own them. Some domains may never yield. Most sit somewhere on the spectrum between, and the honest position is that we don’t yet know where the lines fall.

It is tempting to believe the machines will close the gap for us by learning to judge open-ended work. They are not there. The current best attempt — using a strong language model as the judge of another model’s output — is real and useful and also documented to drift and contradict itself, giving the same work different scores across runs, with a catalog of systematic biases toward length and surface form. An unreliable judge does not rescue you from needing to know what good is. It just hides the moment you stopped knowing.

So the risk I actually worry about is not that we fail to measure the unmeasurable. It is that we route around it. Faced with work whose “done” resists specification, the path of least resistance is to quietly reshape the work into the part that can be specified, automate that, and let the rest atrophy — to optimize for the measurable and, as Jerry Muller documents across a dozen institutions, come to treat whatever resisted measurement as if it were never the point. That is not a failure of the technology. It is a failure of nerve dressed up as efficiency, and it is the most likely way good work gets thinner without anyone deciding that it should.

A loop is a bridge. It will carry you, tirelessly and faithfully, from a problem to a solution — but only to a solution you can already describe well enough to recognize when you arrive. It does not find the far bank for you. You still have to know the end from the beginning, and for the work that actually matters, knowing the end was always the job.

If this is useful, the place I work these ideas out in long form is AndrewLewisWasHere. Subscribe there, or forward this to the person on your team who’s about to point a loop at work that ships as the artifact itself.

Andrew Lewis was Here

Discussion about this post

Ready for more?