Vibe Coding Is Tool Use, and Tool Use Is Not Shameful

There is a sentence that gets said in code reviews, in lab meetings, in the replies under any post about AI-assisted programming. Someone points at a block of generated code — a needless re-implementation of something that already exists three files over, a function that is locally fine but globally structureless — and says: that is the technical debt, that is why you still need a human.

I want to take that sentence seriously, because I think it is doing something more specific than it appears to, and the specific thing it is doing is wrong. Not wrong because the observation is wrong — the observation is usually correct. The generated code often does carry debt. Wrong because of the move that comes after the observation: the step from this tool did this task poorly to a human is, in the relevant sense, more reliable. That step is a category error. It takes a measurement along one narrow axis and reports it as a verdict about rank.

This essay is about that error. It is not a defense of AI, and it is not a complaint about the people who distrust it. I use AI to write code daily. I also think large language models are, at the thing I care about most in my own work — inventing a genuinely new architecture — close to useless. Both of those sentences are true at once, and holding both is the whole point. The error I want to describe is what happens when someone holds only the first half of one of them and mistakes it for a complete picture.

1. What the sentence actually claims

Strip the code-review sentence down to its logical form and it has two parts. The first is an observation: the AI produced low-quality output on this task. The second is a conclusion: therefore the human collaborator occupies a higher position. The word “position” is doing quiet work. The observation was about a task. The conclusion is about a ranking — a standing that holds across tasks, the kind of standing that licenses the word “still” in “you still need a human.”

A task-level observation and a rank-level conclusion are different objects. The observation says: on the axis of not accumulating technical debt in a moderately large codebase, this tool scored low. That is a real measurement and I do not dispute it. But a score on one axis is not a position in an ordering. To get from the first to the second you need an unstated premise — something like the axis I just measured is the axis that determines overall standing. That premise is almost never examined, and when you do examine it, it does not survive.

This is, structurally, the same mistake I have written about before in a completely different setting. A journal’s EDICS list — its published taxonomy of topics — is a genuinely useful instrument for one job: coarse routing. It becomes a problem only when it is read as something it is not, an enforceable definition of the journal’s actual scope. The list is fine. The misreading is the failure. Here the structure repeats. “An LLM generates technical debt” is a fine observation. The misreading is the failure: treating a measurement of tool execution as evidence of a standing between a tool and a species.

And technical debt is only the version of the sentence I hear most often. There are others, and they should be put on the table honestly before the argument proceeds, because an essay that defends LLM-assisted coding against a single weak criticism is not worth writing. The next section takes the criticisms as a set.

2. The criticisms, taken as a set

The case against LLM-assisted coding is not one observation. It is roughly four, and they are worth naming precisely, because once they are named it becomes clear that three of them are the same instrument pointed four times, and the fourth is something else entirely — and worse for the people who reach for it.

The first is technical debt: the generated code is locally fine and globally structureless, re-implements what already exists, and accumulates the kind of disorder that a maintainer pays for later. The second is hallucination, in its sharpest current form. Roughly a fifth of generated code samples reference packages that do not exist, and attackers have learned to pre-register those hallucinated names — “slopsquatting” — so that a developer who installs the suggested dependency without checking pulls in malicious code. The third is the verification cost: a large fraction of AI-generated changes still need manual debugging after passing review and staging, and reviewers describe being buried under the volume of plausible-looking output they now have to scrutinize. The fourth is skill atrophy: the developer who leans on the tool reportedly gets worse — stops reading documentation, stops reading stack traces, becomes, in one widely-quoted phrase, a human clipboard shuttling errors out and fixes back in.

A fifth criticism, security — that generated code carries elevated rates of injection flaws and mishandled secrets — I am going to set aside rather than answer, and I want to be explicit about why. That criticism is real, but its range is production software facing an adversary. The work this essay is about is research code: experiment scripts, model implementations, data pipelines. That code has its own correctness burden, and it is heavy, but it is not the burden of withstanding an attacker. Answering the security criticism here would be answering a charge that was never aimed at the room I am standing in. I would rather say so than pretend a rebuttal.

That leaves four. Look at the first three together — debt, hallucinated dependencies, verification cost. Each is a measurement of one thing: how well the tool executes a generation task. Each is, in the vocabulary I am most used to, a single signal.

A single signal cannot support a global judgment. This is the most basic constraint in any scoring system, and anyone who has built one has run into it directly. When I designed a KV-cache eviction method, the central decision was that a single signal — attention weight, or recency, or one learned importance score — was not enough to decide which entries to keep. A formula that captures only one dimension fails on the entries that survive for the other reason. A “needle” in a haystack is salient but structurally isolated; a backbone concept is structurally central but locally unremarkable. Score on one axis and you correctly rank one of those two and badly misrank the other. The fix was not a better single signal. The fix was to stop pretending one axis was the whole space.

Stacking three single signals does not repair this. Debt, slopsquatting, and verification cost are three readings along what is essentially the same axis — quality of executed output — and three readings of one axis still describe one axis. They do not sum into the multi-dimensional thing a word like “superiority” requires. Used to rank an LLM against a human across the whole space of what either can do, they fail in exactly the way a single-signal cache score fails: approximately right about the narrow region they measured, silently and confidently wrong about everything that survives for a different reason.

The technical-debt criticism has a further problem, and it is the kind that turns the criticism around in the hand of the person holding it. An LLM does not generate structureless, redundant, debt-laden code by some defect native to the model. It generates that code because it was trained on a corpus of it. Security researchers make the parallel point about insecure code directly — models reproduce the vulnerable patterns present in their training data — and debt is the same story without the adversary. The disorder in the generated output is a statistical echo of the disorder in the human-written code the model learned from. So when someone points at a debt-laden generation and reads it as evidence that human practice is the more reliable party, they are pointing at a mirror. The debt did not originate with the model. It originated with the corpus, and we wrote the corpus. The criticism, followed honestly back to its source, is not evidence that human coding practice is debt-free. It is evidence that human coding practice has enough debt in it to teach.

I want to keep the size of this claim controlled. A training corpus is some subset of human-written code, not a verdict on every programmer, and the inference “the corpus carries debt, therefore humans carry debt” would overreach if pushed as a statement about individuals. The narrow version is enough, and the narrow version is all I need: the debt in LLM output is inherited, not invented, which means debt is a property shared across both sides of the comparison. A shared property cannot rank the two parties against each other. It can only show that the axis was never a dividing line to begin with.

None of this means the debt is not real. It is real, and I do not want the argument to slide past that. Generated code does carry structural disorder, and pretending otherwise would be its own kind of dishonesty. The point is about how much the disorder costs in the specific setting this essay is about. Research code is not a product with a ten-year maintenance horizon and twenty future maintainers. It is, mostly, code that runs to produce a result and is then archived. Its primary obligation is that the result be correct — that the disorder does not propagate into a misleading conclusion. As long as that boundary holds, structural debt in a research script is a real defect with a bounded consequence: it makes the code less pleasant to reuse, and it does not corrupt the science. That is a defect worth noting and not a defect worth organizing one’s tool choices around.

There is also a quieter irony in the hallucination criticism specifically, and it is worth pausing on, because it cuts toward the symmetry the rest of this essay depends on. Slopsquatting does not work because the LLM hallucinates a package name. It works because a human then installs that package without checking. The security analyses of the attack say this almost in passing: the mechanism is psychological, the developer trusts a suggestion because it sounds professional and solves the immediate problem. The hallucination is one half of the failure. An unverifying human is the other half, and the attack is inert without it. A criticism offered as evidence that the machine is the unreliable party turns out, on inspection, to require human unreliability to do its damage.

And here I want to be just as direct about where the burden actually falls. Hallucination is a real property of these models — they will invent a package, an API, a citation, and do it fluently. The correct response to a real defect you have been told about is not to hope it will not appear this time. It is to check. Once you know an LLM can hallucinate a dependency, installing one it suggested without verifying that the package exists and is what it claims to be is not the model failing you. It is you declining to do the one thing the known defect obviously requires. The hallucination is the model’s; the decision to act on unverified output is the human’s. Treating a known, announced failure mode as if it were a surprise is not a property of the tool. It is a lapse of the person using it.

The fourth criticism — skill atrophy — is not a single signal, and I am not going to treat it as one. It deserves its own answer, because it is not even a measurement of the tool. It is a measurement of a usage pattern, misattributed to the tool.

The research that established the effect also dismantled the misattribution. The skill loss does not track whether a developer used an LLM. It tracks how. Developers who delegated whole problems — handed the task over, accepted what came back, moved on — scored worst. Developers who used the model for conceptual inquiry, who asked it to explain rather than to deliver, who stayed in the loop of reading and deciding, scored as well as or better than the unaided baseline. The variable that predicts atrophy is not the presence of the tool. It is the choice to be passive in front of it.

That distinction relocates the responsibility, and it relocates it onto the user. If you design a system architecture by passively accepting whatever an LLM returns — not reviewing it, not refactoring it, not deciding the shape yourself — and your own capacity dulls as a result, the cause of that dulling is a decision you made, not a property of the model. Attributing it to the LLM is misattribution in the precise sense: assigning to the instrument an outcome that was produced by how it was held. It is the most naked instance of the category error in the whole set, because here the observation is not even about the tool. It is about the person, dressed up as a fact about the tool.

So: four criticisms, and not one of them survives as a rank-level claim. Three are single readings of the execution axis and cannot, even stacked, become a multi-dimensional verdict. The fourth is not about the LLM at all. The error in every case is the same — asking an observation to carry a judgment of a different kind than the observation can support.

Having said all of that, I want to be careful not to overcorrect into the opposite dishonesty. Granting every criticism its real weight — the debt, the hallucinated dependency, the verification cost, the genuine risk of letting one’s own skill dull — and pricing each one in fully, the picture that remains is still, for me, decisively in favor of working this way. The thing that survives the accounting is efficiency, and it has to be stated in the right unit to be stated honestly. The efficiency is not the LLM’s; it is not a score the model earns as a standalone performer. It is the efficiency of a human researcher under assistance. The measured quantity is my own rate of producing working implementation code, and that rate, with an LLM in the loop, is far higher than it is without one — even after I have paid for the debt, run the verification, and stayed deliberately in the loop to keep my own skills intact. The defects are real and they are mine to manage. The productivity gain is real too, and it is large enough that managing the defects is plainly worth it. That is the honest ledger, and it is the reason I work the way I do.

There is a second problem, and it is the one I find more interesting, because it implicates the people making the argument rather than the argument’s form.

3. The jump is also pointed the wrong way

Suppose you grant everything in the code-review sentence. The generated code does carry debt. Suppose further that you want to defend the proposition that humans bring something an LLM does not. I think that proposition is true. But the person reaching for “an LLM generates technical debt” to defend it has picked the wrong battlefield, and picked it in a way that gives away how little they have looked at their own side.

Here is what I can say from direct experience. As a tool for producing code inside an established pattern, a language model is efficient in a way I genuinely value. As a collaborator on inventing a new structure — the part of my work where the question is not “implement this” but “what should the shape of this even be” — it is not merely weaker. It is, in my experience, comprehensively bad. And I can describe the failure precisely, because I ran the experiment without meaning to.

Some of my recent work involved designing a KV-cache eviction method by borrowing structure from a computational model of human memory I had built earlier. Before doing the design by hand, I asked a language model to draft a version — exactly the kind of delegation I am, in this essay, defending as reasonable. What came back was revealing. The model took my memory framework and ported it wholesale. It carried over the variables that model the defects of human memory — the distortion a memory accumulates, the fragility it enters when recalled — and it went one step further: it built in the idea that a cache entry should become unstable each time it is accessed.

Sit with that last part, because it is the whole point. A cache entry becoming unstable on access is a faithful translation of how human memory works — recall genuinely does destabilize a biological memory trace. But a KV-cache entry is exact digital storage. Reading it is just reading it; the vector does not deform, does not degrade, does not become fragile, no matter how many times attention touches it. “Access causes instability” has no referent in a KV cache. There is no phenomenon there for the mechanism to model. The model did not observe that property and decide to handle it — it could not have, because the property does not exist. It imported the mechanism because the source framework had it, and faithful porting is what the model does.

This is the failure, stated generally. The model’s weakness at architecture is not that it cannot carry a framework across. It carried mine across enthusiastically. The weakness is that it cannot decline to carry a piece across. It has no step where it stands in front of the new problem and asks, of each borrowed part, whether the thing that made this part necessary in the old domain even exists in the new one. That interrogation — deciding which parts of a borrowed framework are signal and which are defects of the original that the new domain does not share — is not interpolation, and it is not a high-quality average of a training distribution. It is the specific cognitive act that designing a new structure consists of, and it is precisely the act the model skips. When I built the method by hand, the central work was exactly that subtraction: recognizing that the memory framework’s distortion and fragility machinery had no counterpart in an exact cache and had to be left at the door. The published design is, in large part, a list of what was deliberately not imported — and a list of what to leave out is not a thing a wholesale porter can produce.

Notice what this does to the code-review sentence. The human advantage is real, but it is not located where the sentence points. The sentence points at not making technical errors. The actual advantage is at inventing structure. On the first axis the human margin is narrow and, if the last three years are any guide, narrowing. On the second axis the margin is enormous. Someone defending human contribution by citing technical debt has chosen the one axis where their case is weakest and has not noticed that the strong case was available the whole time. They are right that humans bring something. They have simply failed to say what.

4. Why a memory researcher does not romanticize human reliability

There is a version of the human-superiority argument that leans on reliability — humans as the steady reference, AI as the erratic newcomer. I want to address it from a particular angle, because my other line of work makes the angle unavoidable.

I study human memory, and that is the angle. The previous section described throwing the memory framework’s distortion and fragility machinery out of the cache design. It is worth being clear about what that machinery was for in the original framework: it was there because human memory genuinely has those properties. Memory is not exact storage. It distorts on recall. It becomes fragile when reactivated. It loses even intense memories without rehearsal, and it occasionally manufactures detail that was never encoded. The variables I discarded were not arbitrary — they were a fairly complete catalog of the ways biological memory fails.

Sit with that for a moment. In the one place where I got to design a memory system from scratch, the human memory system was the thing I treated as a catalog of failure modes to avoid. A system that distorts, fabricates, and decays under load is not the gold standard of reliability. It is the cautionary example. The intuition that human cognition is the steady reference point against which the machine looks erratic does not survive contact with what human cognition is actually made of.

I do not say this to invert the ranking — to claim the machine is the reliable one. That would be the same category error wearing the opposite jersey. I say it to remove a specific prop from under the human-superiority argument: the prop that treats human reliability as the fixed point. It is not a fixed point. It is a system with its own well-documented failure modes, and I have a paper trail proving I know them well enough to engineer around them.

And I have a paper trail of one more relevant kind. I once came close to publishing a confident mechanistic explanation — an account of why a stereo-matching model gave divergent numbers across GPU architectures — and the explanation was wrong. It was wrong in the most human-cognition way possible: it was a tempting story, internally coherent, dressed in plausible mechanism, and fully formed in my head before it had been adequately tested. I had it ready to go out. What stopped it was not some native human immunity to error. What stopped it was roughly thirty experiments I ran to falsify my own account, which did falsify it — and which then nearly led me to a second tempting story that a different prior project of mine had already refuted.

I publish that episode separately, as its own negative result, because the lesson in it is precise and it is the lesson this section needs. A confident, fluent, internally coherent error that passes every surface check is not a special pathology of language models. I generated one unaided. It was headed for a paper. The only thing between that error and print was verification — a process I had to choose to run and could just as easily have skipped. My reliability, in that instance, was not a property I had by virtue of being human. It was a property the verification produced. And verification is not something only a human output can be subjected to; an LLM’s output can be put through exactly the same thirty experiments. If “is capable of fluent, confident, wrong output” were the disqualifying property, it would disqualify me. It does not disqualify me, because the thing that actually does the work — checking — is available on both sides of the comparison and belongs to neither.

5. The honest picture is division of labor, not rank

What replaces the ranking is not a different ranking. It is a different shape of question.

The picture I am defending is division along dimensions. AI: efficiency in code generation, consistency, tirelessness, no need for the output to be invented from nothing. Humans: the invention of new structure, the cross-domain transport of concepts, the judgment about whether a problem is even worth solving the way it is posed. These are different axes. Neither list is a position above the other. They are a description of what each party is for.

Vibe coding sits cleanly inside this picture, and that is the whole reason I am not embarrassed by it. Using a tool to do the thing the tool is good at is not a concession. The embarrassment that the code-review sentence is designed to produce depends entirely on the category error: it depends on you accepting that a tool’s execution defect on one axis is a referendum on your standing. It is not. The thing worth being embarrassed about is not using AI. It is failing to tell the two axes apart, and then handing out a superiority verdict on the basis of the wrong one.

I want to be honest about the strongest objection to this, in the spirit of not publishing a one-sided case. The objection is that the boundary I have drawn — humans win at inventing structure — is a snapshot, not a law. I am describing 2026. Three years ago, code generation was where humans were said to be permanently safe, and that margin narrowed faster than almost anyone predicted. Perhaps architectural invention is next, and my section 3 is just describing a line that is about to move.

I think the objection is partly right, and the part where it is right does not hurt the essay — it strengthens it. The location of the human advantage may well move. What does not move is the logical defect in the code-review sentence. Even if a model someday matches a human at architectural invention, the inference this tool executed one task poorly, therefore a human ranks higher will still be invalid, because a single-axis measurement still will not be a multi-axis verdict. My argument deliberately does not depend on where the boundary currently sits. It depends only on the claim that you cannot read a rank off one coordinate — and that claim is true regardless of which coordinate you pick or how the numbers change.

So I will end where the evidence actually leaves me. Humans bring something to this collaboration that is real, large, and — at least for now — not in serious dispute. It is the invention of structure, not the avoidance of error. A pride built on the second thing is built on a foundation that a single counterexample, or a single model release, can crack. A pride built on the first thing is, as far as I can currently see, built on rock. The people reaching for technical debt to defend human worth are defending something true with an argument that is false, and in doing so they obscure the better argument sitting right next to it. I use AI, without embarrassment, because I can tell which argument is which.