Harness Engineering: Why Your AI Agent Isn't Dumb, Its Environment Is

The hardest problem in AI engineering stopped being the model eighteen months ago. It's the loop around it. And almost everyone shipping agents in 2026 is still pointing at the wrong thing when they fail.


The $9 Receipt

In March 2026, an Anthropic engineer named Prithvi Rajasekaran ran a simple experiment. He gave a coding agent a single prompt: build a full-stack application. No scaffolding, no review loop, no verifier. Just the model, a tool loop, and enough budget to go.

Twenty minutes later, it came back. The agent had spent $9, declared the job done, and handed him broken code. Not silently broken—confidently broken. The agent said the application was finished, and it wasn't.

Then he did something different. He rebuilt the same task with a proper harness around the model: a planner to decompose the work, a generator to write it, and a separate evaluator to test each piece against explicit contracts. Same model. Same prompt. Different loop.

Six hours and $200 later, it handed back a working full-stack app with polish the first run didn't come close to.

The receipt
Solo run
$9
20 min · broken
Full harness
$200
6 hr · shipped
Cost delta
>20×
and the only one they'd ship
“The harness was over 20x more expensive, but the difference in output quality was immediately apparent.”
— Prithvi Rajasekaran, Anthropic Engineering

That 20× gap is the most important number in AI engineering right now, and hardly anyone is talking about it.


The Reframe

You've felt this failure. Your agent was crushing it for an hour. Then it started repeating itself. It hallucinated a file path. It claimed a test passed that didn't exist. It wrapped up the task with half the work done. You rolled your eyes and muttered something about the model being dumb.

So you did what everyone does. You added more prompt. You enabled a bigger model. Maybe you swapped vendors. Nothing fixed it, because nothing about those moves touched the actual problem.

The model was never the issue. The loop around the model was.

There's a name for this loop now, and once you see it, you can't unsee it. It's called the harness—and the work of designing it is the most consequential shift in AI engineering since prompt engineering stopped being enough.


What a Harness Is

Boris Cherny, the engineer who built Claude Code, described the relationship with an analogy that's been quietly circulating in the field since last September:

“It's sort of like if you're riding a horse, you need some sort of saddle, and that saddle makes a giant difference when you're riding a horse.”
— Boris Cherny, as reported by OfficeChai

The framing that followed—Claude is the horse, Claude Code is the harness—made the abstraction legible. The model has power. But power without steering, without constraints, without a seat for the rider, is a horse running itself into a wall.

The HumanLayer team put the whole idea into a one-line equation you can tape to a wall:

coding agent = AI model(s) + harness
— Kyle, HumanLayer

Read that equation carefully. The agent isn't the model. The agent is the model plus the infrastructure wrapped around it: the tool-calling loop, the context policy, the verification checkpoints, the state that survives between sessions, the sub-agents that handle work the main one shouldn't see. All of that is the harness. All of that is the thing you actually build.

Here's the load-bearing claim: in 2026, almost every production AI failure you'll see is a harness failure wearing a model-failure disguise.


Four Failure Modes You Finally Have a Name For

Before you can build a good harness, you need vocabulary for the ways bad ones fail. Every one of these has been measured. Every one has a citation. And every one is probably killing an agent in your stack right now.

1. Context Rot

Models degrade as context fills. That sentence is deceptively simple. The mechanism isn't about hitting the context window limit; it's about where information lands inside the window.

Liu et al.'s “Lost in the Middle” paper (Stanford, 2024) showed that models attend well to information at the beginning and the end of context, and poorly to information buried in the middle. The accuracy drop for mid-window content can exceed 30%. Chroma's 2025 follow-up survey tested 18 frontier models—GPT-4.1, Claude Opus 4, Gemini 2.5, and more—and found that every single one exhibits this behavior at every input length increment they measured.

The root cause is structural. Most modern LLMs use Rotary Position Embedding (RoPE), which introduces a long-term decay effect that de-emphasizes middle positions. Context rot isn't a bug waiting for a patch. It's a property of how these models represent position at all.

What this means for your harness: if you hand the model a wall of history and expect it to pluck the one instruction it needs from the middle, you're building on sand.

2. Self-Praise Bias

Ask a model to grade its own work and it will pass itself. Not maliciously—statistically. It's a measurable bias, and it is everywhere in agent loops that ask the agent to “check if you're done.”

“Out of the box, Claude is a poor QA agent.”
— Prithvi Rajasekaran, Anthropic Engineering

That's Anthropic's own engineer writing about Anthropic's own model. The failure mode generalizes to every frontier model we have. If your “done” signal comes from asking the model whether it's done, your agent will declare victory over a pile of bugs.

3. Session Amnesia

Every model session starts with no memory of what came before. This is such a basic property that engineers routinely forget it until they try to do work that lasts longer than a single context window.

Without a structured handoff, a multi-hour project rots between sessions. The agent forgets what shipped, re-derives decisions it already made, misses constraints the previous run discovered. Anthropic's Applied AI team, in their September 2025 piece on context engineering, names three techniques for surviving this: compaction (summarizing history and reinitializing), structured note-taking (persisting facts to files the agent can read on the next run), and multi-agent architectures (offloading subtasks to agents with clean contexts).

The one thing none of them do is assume the model will remember.

4. Tool Ambiguity

This is the failure mode that feels smallest and punches hardest. If the description of a tool is unclear—if the parameter names are ambiguous, the return values undocumented, the edge cases unlisted—the model will pick the wrong tool, or the wrong arguments, and do so confidently.

“Bad tool descriptions can send agents down completely wrong paths, so each tool needs a distinct purpose and a clear description.”
— Jeremy Hadfield et al., Anthropic Engineering

Anthropic's multi-agent research team treats tool documentation as infrastructure, not docs. Your model is only as good at tool use as the signposts you give it.


Five Primitives That Fix Them

Good news: you don't need a research lab to build a real harness. The primitives are simple, composable, and mostly boring. The hard part is committing to them before you've burned a week on yet another prompt rewrite.

1. Externalize judgment with a separate evaluator

Self-praise bias is solved architecturally, not by prompting. You split the system into a generator that does the work and an evaluator that checks it, and you give the evaluator teeth—real tests, real contracts, real binary signals the generator can't talk its way around.

Rajasekaran's harness took this further. Inspired by Generative Adversarial Networks, he added a planner in front:

Planner Generator Evaluator Pass / retry

The planner decomposes the work. The generator executes. The evaluator runs Playwright contracts, unit tests, or whatever hard signals the domain affords. If the evaluator fails, the loop goes back to the generator. No ego, no hedging, no “looks good to me.”

2. Persist essentials between sessions

The antidote to session amnesia is a handoff artifact the next session can read in thirty seconds. A minimal one looks like this:

JSON
{
  "project": "order-service-migration",
  "updated": "2026-04-13T09:42:00Z",
  "features": [
    { "id": "F1", "name": "Stripe checkout flow",       "status": "passing" },
    { "id": "F2", "name": "Idempotent webhook handler", "status": "failing" },
    { "id": "F3", "name": "Back-office refund UI",      "status": "todo"    }
  ],
  "constraints": [
    "Must preserve existing /v1/orders contract",
    "No new dependencies without review"
  ],
  "next": "F2: webhook handler dedupe on event_id"
}

Every feature starts as todo or failing, never passing. The harness flips a feature to passing only when the evaluator says so. The next session reads this file first and knows exactly where to pick up. You have just defeated two failure modes—amnesia and premature completion—with a fifty-line JSON file.

3. Isolate subtasks in sub-agents with clean context

Context rot compounds fast. If your main agent has to read logs, search the codebase, and write the next feature, its working memory is full of junk by the time it starts writing. The fix is delegation. Spin up a sub-agent for log reading, dump its findings into a short summary, and pass only the summary back to the main agent.

This is the pattern Anthropic's own research harness uses. A lead agent coordinates strategy; sub-agents explore aspects of the problem in parallel, each with a clean context window; their findings come back to the lead for synthesis. Your main thread only has to do two things: dispatch and converge.

4. Back-pressure with real verification

Tests, type checks, linters, browser contracts, schema validators—these are the parts of your toolchain you already trust because they can't lie. Wire them into the agent loop as blocking checks. If the code doesn't compile, the harness doesn't advance. If the test fails, the feature stays failing. No negotiation.

Anthropic's December 2024 research post on Building Effective Agents makes the case for simplicity here: the strongest agent systems aren't the ones with clever recovery logic. They're the ones that hand judgment over to deterministic signals wherever deterministic signals exist.

5. Treat tool descriptions as first-class engineering

If you only do one thing this week, do this one. Go read the descriptions of the tools you've given your agent. Now imagine you're a new hire reading them for the first time. Would you pick the right tool? Would you know what the parameters do? Would you know what's in the return value?

If the answer is no, the model is going to fail the same way—except the model won't stop and ask. Good tool descriptions are short, specific, and biased toward examples. The upgrade is cheap. The payoff compounds every single call.

One warning. Every harness you build encodes assumptions about what the current generation of models can't do on its own. Those assumptions go stale fast. A scaffold that was load-bearing in Q1 may be dead weight in Q3. Revisit the harness every time a new frontier model lands—some of your primitives will quietly stop earning their keep.

You've Been Tuning the Wrong Knob

In February 2026, the LangChain team published a piece that should have ended the debate. They took a coding agent on Terminal Bench 2.0, one of the hardest agentic-coding benchmarks. They held the model fixed at gpt-5.2-codex. They touched nothing about the weights, the prompts to the base model, or the training.

All they changed was the harness—system prompt, tool design, and middleware. The result:

LangChain on Terminal Bench 2.0
Before
52.8%
Top 30
After
66.5%
Top 5
Model changed?
No
gpt-5.2-codex throughout
“Our coding agent went from Top 30 to Top 5 on Terminal Bench 2.0. We only changed the harness.”
LangChain

A 13.7-point absolute jump. Top 30 to Top 5. With no help from the model. Think about how much of the AI industry is spending its attention on the wrong side of that equation.

The competitive edge in 2026 is not access to a better model. Everyone has access to the same three or four frontier models. The edge is harness taste—knowing which loops to build, which primitives to skip, and when the scaffolding has gone stale. The teams who figure this out will ship agents that feel like magic. The teams who don't will keep yelling at a horse.


Takeaways

If you only remember four things:
  • The model is no longer the bottleneck. The harness around it is.
  • Context rot, self-praise bias, session amnesia, and tool ambiguity are named, measured failure modes—not vibes.
  • External evaluators, persisted handoffs, isolated sub-agents, real verification, and good tool descriptions are the primitives that fix them.
  • LangChain moved from Top 30 to Top 5 on Terminal Bench 2.0 without touching the model. If that isn't a wake-up call, nothing is.

Stop prompting harder. Go look at your loop.