LLMs Are Cowboy Coding
Anthropic's harness engineering article highlights what's missing from most AI coding: scrutiny and evaluation.
Anthropic’s recent article on harness agent design highlights how LLM codebases are filled with cowboy coders.
This brings to mind the classic critique of LLMs: their autoregressive nature means once you start down a bad generation trajectory, it’s very hard to self-correct. That scales directly to coding. An LLM can handle a small scoped task and get it functional, but functional isn’t the same as good. The code degrades as complexity accumulates, and none of it fits into writing a codebase that holds up over time.
A first pass at fixing this would be to ask the agent to evaluate its own work. But of course, they’re biased.

“When asked to evaluate work they’ve produced, agents tend to respond by confidently praising the work — even when, to a human observer, the quality is obviously mediocre”
Instead, Anthropic focuses on a pattern that has been around since the beginning of coding: plan, implement, get a third-party evaluation. Adversarially critique it until the polish is actually high. Then ship.
This mimics how high quality codebases are built. One person takes a feature, puts up a PR, and it gets scrutiny. Higher scrutiny correlates with higher quality over time. It encodes that people care about the code and where it’s going. Look at quality open source projects that need to live for years. They generally have a high bar with real discussion on PRs.
The root issue is that LLMs are cowboy coding. Minimal review, minimal scrutiny, ship it so it works. What’s missing is the staff engineer to come in and pick apart the poor decisions, the shortcuts, and bring knowledge of the whole codebase. That’s what Anthropic focused on building: the harness component that IS that staff engineer.
“Every component in a harness encodes an assumption about what the model can’t do on its own”
That gap exists today. What’s missing is scrutiny and evaluation. And it’s a hard problem. How do you encode taste? How do you encode craft? There’s good bits in the article about how they apply this to frontend design, but there’s no universal review prompt that solves it.
Right now the pattern for most AI coding is: plan, implement, get it to work, ship. First draft, straight to production. But what if we actually spent time on evaluation? On iterating against real critique instead of accepting the first thing that passes? We’re all shipping first drafts right now. What we need is a better editor to sharpen them into something publishable.