The agent harness — the code around the model that actually makes an agent

The demo always looks the same. The model solves something impressive in a single turn. The founder walks through it at the conference. It trends. Then a month later, production is a graveyard of half-finished tasks, blown context windows, tool errors the user never saw, and a cost curve that looks like a hockey stick.

The problem is almost never the model. The problem is that someone built a demo around a model and called it an agent. The model is the engine. What they forgot to build was the car.

That car — the runtime scaffolding that wraps the model and turns a stateless completion API into something that can pursue a goal across dozens of steps — is the harness. And in 2026, the harness is where nearly all of an agent product’s real differentiation lives.

What a harness is and what it is not

A model call is stateless. You send a context window in; you get tokens out. That’s it. There is no memory between calls, no ability to act in the world, no notion of “the task I was working on ten minutes ago.” A model call is not an agent.

A harness is the code that takes a stateless model and makes it behave like an agent. Concretely:

It runs the loop — feed the model context, receive output, decide what to do next, repeat.
It manages what goes into that context and what gets dropped.
It dispatches tool calls and routes results back.
It decides when to stop and how to recover when things go wrong.

The model provides the intelligence. The harness provides the continuity, the authority, and the guardrails. A better metaphor than “engine and car”: the model is an extremely knowledgeable person who can only communicate by reading and writing documents. The harness is everything else in the organisation — the inbox, the policies, the approval chains, the file system, the calendar, the person who says “you’ve been in this meeting for four hours, let’s wrap up.”

The model just maps context to tokens. Everything that turns that into an agent — the loop, dispatch, context, authority, exit conditions, persistence — is the harness.

The anatomy of a harness

1. The agent loop

The core is the think-act-observe loop. The model receives a context (system prompt, conversation history, tool results, observations) and produces output — either a final answer, a tool call, or both. The harness executes the tool calls, observes the results, appends them to the context, and feeds the next turn back to the model. Then repeat.

This is the loop you find in every production harness: Claude Code’s agentic loop, SWE-agent’s task scaffold, OpenHands’ execution cycle. They differ wildly in sophistication, but they all have this at the bottom.

The loop itself is easy to write. The hard part is everything that happens inside it — all the other components below.

2. Tool dispatch and JSON-schema validation

A model proposes a tool call. The harness must: parse the call, validate it against the tool’s declared schema, execute the tool, and return the result (or error) in a form the model can parse on the next turn.

This seems mechanical until it isn’t. Silent tool errors — where the harness swallows a failure and returns an empty result, or a partial result with no error signal — are one of the most common sources of agent failure in production. The model sees success where there was none, continues on a bad premise, and compounds the error across the next ten steps.

Good harnesses treat tool results as first-class. They return typed errors. They surface whether a result was truncated. They include timestamps or version metadata when the tool output might be stale. They distinguish “the tool ran and returned nothing” from “the tool failed to run.”

Aider’s search/replace loop is a clean example: if the model’s proposed old_string doesn’t match the file, Aider returns an explicit mismatch error that the model sees and must correct before continuing. The harness is enforcing the contract.

3. Context management

The context window is the agent’s working memory, and it fills up. In a multi-step task, the raw conversation history of every prior turn, every tool call, and every tool result grows quickly. Left alone, you hit the window limit and the model starts losing its earliest context — usually the most important context, like the original task and the constraints the user stated.

Context management is how the harness handles this. It has several sub-problems:

Windowing is the most basic: keep the last N turns and truncate the rest. This is what most toy agents do. The problem is that “most recent” is not the same as “most important.” The model forgets why it started.

Compaction and summarisation is the more sophisticated approach. Instead of truncating, the harness periodically summarises the history — converting detailed tool-call transcripts into higher-level prose (“the agent has so far verified that the tests pass and has modified three files”) — and replaces the raw history with the summary. Claude Code does this explicitly; it generates a compacted summary of prior context before each new segment of work. The model loses granularity but retains the semantic shape of what happened.

Selective retention is the hardest. Not everything warrants equal weight. The original task spec, user-stated constraints, and discovered blockers should almost always stay. The intermediate scaffold of a succeeded sub-task probably does not. Good harnesses have policies for what to pin and what to compress. Most harnesses in the wild do not.

4. Permissions and sandboxing

The model proposes. The harness disposes.

This is the permission layer, and it is load-bearing. A model can propose any tool call — including ones that delete production data, make irreversible API calls, or install arbitrary code. The harness is the only thing standing between the model’s proposal and the real world.

In Claude Code, the permissions config explicitly enumerates which bash patterns are auto-approved, which require confirmation, and which are blocked. The model never calls the shell directly — it emits a Bash tool call, and the harness decides whether to run it. Devin runs tools inside a sandboxed cloud environment so that even an unchecked shell command can’t escape to production infrastructure.

The failure mode when permissions are too loose is obvious: the model does something irreversible. The failure mode when permissions are too tight is subtler: the model gets blocked, retries, gets blocked again, burns tokens, and either gives up or starts finding workarounds. Both failure modes are harness problems, not model problems.

SWE-agent handles this differently — it runs inside a Docker container with a defined set of commands (view, edit, search, run test, bash). The allowed vocabulary is narrow but fully trusted within its bounds. OpenHands takes a similar approach with a sandboxed execution environment that includes a browser, a shell, and a filesystem, all isolated from the host.

5. Sub-agents and delegation

Long or complex tasks benefit from decomposition. The harness can spawn a sub-agent — a separate model call (or separate agent process) that handles one piece of the work — and assemble the results. This looks superficially like parallel processing but it’s really about scope isolation: each sub-agent gets a clean context window scoped to its sub-task, rather than a single agent accumulating the entire history of a hundred-step task in one increasingly unwieldy context.

Claude Code’s subagents are the most visible example: a parent agent can spawn a headless child agent for a bounded task (“run this eval suite and report the results”), and the child’s work does not pollute the parent’s context. OpenHands’ architecture separates planning from execution similarly.

The harness manages the delegation contract: what the sub-agent is allowed to do, what it returns, and how failures propagate back to the parent.

6. Stop conditions and loop-runaway guards

A loop needs an exit. For a coding agent: task complete (verified), stuck (repeated failures with no progress), cost budget exceeded, turn limit reached. For an autonomous research agent: question answered (with evidence), source exhausted, time limit hit.

Loop runaway — the agent continuing to take actions after the task is done or after recovery is clearly hopeless — is one of the more embarrassing failure modes in production. It’s usually a sign that the stop conditions are too weak or that the model’s self-assessed “done” signal is being trusted without verification.

Strong harnesses have explicit turn limits and cost ceilings that fire independently of the model’s self-assessment. They also detect loop patterns: if the model has called the same tool with the same arguments three times in five turns, something is wrong.

7. Verification and feedback

The harness can give the model a second opinion about its own output. This is the verification layer — running tests, invoking a linter, calling a second model to check the first, or even asking the user. The result feeds back into the loop as another observation.

Claude Code runs the test suite after each edit batch if configured to do so. SWE-agent’s loop feeds test failure output back to the model as structured observations. The model isn’t just proposing code — it’s being shown evidence about whether the code works, and the harness is producing that evidence.

This is what separates a coding agent from a code generator. A generator outputs code once. An agent observes the results of its code, recognises failure, and tries again — and the harness is what makes that feedback loop possible.

8. Persistence and checkpointing

A task that runs for twenty minutes is a task that must survive a crash, a context limit, or a mid-session “I need to pause this.” Persistence is the harness serialising the agent’s state — current context, tool-call history, sub-agent results, pending actions — in a way that lets work be resumed cleanly.

Most toy harnesses skip this entirely. Production harnesses treat it as critical infrastructure. Claude Code stores transcripts to disk so that sessions can be resumed. Devin’s workspace persists across sessions because the sandboxed environment — files, shell state, browser state — is the checkpoint.

The deeper problem is what to checkpoint. The model weights are stateless by design. What matters is the accumulated context — the work history the model needs to continue intelligently — and the environment state — whatever the model has done to the world so far. Both need to be captured for a checkpoint to be genuinely resumable.

Real harnesses and what each bets on

The anatomy above is abstract. Real harnesses show where the bets land.

Claude Code has the most sophisticated harness of the coding agents. Its context compaction is explicit and documented. Its permissions model is configurable at the pattern level. Its tool set is expansive — the shell is the tool — and the harness manages authority over it. The investment is visible: Anthropic published a detailed system prompt and internals post that shows a harness designed to be transparent to the model and the user.

Cursor Composer bets its harness on the planner-then-apply pipeline. The harness runs a planning pass, then dispatches a fine-tuned Apply Model to convert the plan into exact file edits. The context management is handled partly by the indexer (pre-staging context) rather than in-loop. The permission layer is lighter — the user reviews diffs rather than the harness managing a policy file.

SWE-agent, the Princeton research system, made the harness the research contribution. The AgentComputer Interface (ACI) is a purpose-built tool set for software engineering tasks — constrained commands, structured outputs, explicit feedback loops after each action. SWE-agent’s results on SWE-bench improved substantially when the ACI was refined, with the same underlying model. That’s a clean demonstration that harness quality matters independently of model quality.

OpenHands (formerly OpenDevin) takes the broadest scope: a sandbox with a browser, a shell, and a filesystem, orchestrated by a harness that supports multiple agent strategies. The harness is explicitly designed to be swappable — you can run CodeAct, browsing agents, or custom strategies through the same execution infrastructure.

Aider runs the simplest harness that works. Its loop is tight: one model turn, one set of proposed edits, one commit. The context management is the repo map (a compressed repo structure in every system prompt). There is no compaction, no sub-agents, no sophisticated stop conditions. The simplicity is intentional — it makes Aider auditable and reproducible in ways that more complex harnesses are not.

Devin had, at the time of its announcement, the most complete harness in the field: sandboxed cloud environment, persistent workspace, planning layer, sub-task delegation, verification against tests. The harness was the product. Subsequent evaluation showed that harness quality — specifically, the verification and stop condition layers — was where the gap between demo performance and real-task performance lived.

Why the harness matters more than the base model for product quality

There’s a pattern that repeats: a new model release comes out, teams upgrade, and their agent doesn’t get meaningfully better. The model benchmarks say it should be better. Why isn’t it?

Almost always: because the gains the new model offers require the harness to use them. A model with better instruction-following is only better if the harness gives it clear instructions. A model with a longer context window is only better if the harness has a compaction strategy that doesn’t waste the extra space on redundant history. A model with better tool-use is only better if the harness is returning well-structured tool results.

Model improvements are latent capacity. The harness is what converts that capacity into task performance.

This is why the SWE-bench leaderboard is a harness benchmark as much as a model benchmark. The teams at the top are not just using the best model — they are running the most sophisticated harnesses. GPT-4o and Claude 3.5 Sonnet occupy the same benchmark, but the top-ranked systems are running completely different harnesses on top of them.

Harness failure modes

The common ways harnesses break in production:

Context rot. The context window fills with low-value content — verbose tool outputs, repeated background, accumulated redundant observations — and the model’s attention dilutes. The model starts ignoring the original task because it’s buried under twenty turns of intermediate steps. Fix: compaction policies and pinned-prompt sections.

Runaway loops. The agent keeps going after the task is done, or keeps retrying a stuck state. Fix: explicit turn limits, cost ceilings, and loop-pattern detection that fire independently of the model’s self-assessment.

Silent tool errors. The harness swallows a tool failure and the model proceeds on a false premise. Fix: structured error returns, typed failures, explicit “tool returned empty” signals.

Lost state on resume. The user pauses and resumes, but the resumed context doesn’t accurately reflect what was done. The agent re-does work or contradicts prior decisions. Fix: explicit checkpoint format that captures both context history and environment state.

Over-broad permissions. The agent is allowed to do irreversible things without human review. Fix: tiered permission model — auto-approve for read and low-risk operations, require confirmation for writes, block for destructive or hard-to-reverse actions.

How to evaluate a harness

Evaluating a harness is different from evaluating a model. The right metrics are task-level, not turn-level.

Trajectory cost — total tokens consumed per completed task. A harness that completes the same tasks in fewer tokens (not turns, tokens — one turn can be cheap or expensive) is a better harness, assuming quality is equal.

Tool-call validity rate — what fraction of tool calls the model emits are syntactically valid and match the schema. A well-structured harness with good schema documentation should be near 100%. Drops below 95% indicate the harness is not giving the model adequate schema information or error feedback.

Error recovery rate — when a tool call fails, does the model correct and succeed on a retry? A harness that returns rich, structured errors to the model should produce high recovery rates. A harness that returns opaque errors or swallows failures will produce low ones.

Task completion at turn limit — what fraction of tasks complete before the turn limit fires? A high number here means stop conditions and turn limits are well-calibrated. A low number means the harness is either too aggressive (stopping too early) or the agent is genuinely stuck and the harness isn’t detecting it.

State coherence on resume — after a checkpoint and resume, does the agent correctly continue without re-doing completed work? This is hard to measure automatically but worth a manual evaluation pass.

What to take away

The harness is not boilerplate. It is the product. The model is the commodity inference layer. The harness is where task completion, cost efficiency, and reliability are actually built.
Context management — compaction, selective retention, windowing — is the single most under-engineered component in most agent systems. Budget time for it.
The permission model is a values decision, not just an engineering decision. Where you draw the line between auto-approve and require-confirmation shapes what your agent can accomplish and what risks it imposes on users.
“The model got better but my agent didn’t” is a harness diagnosis, not a model complaint. Figure out which component — context, tools, stop conditions, verification — is the bottleneck before changing the model.
Evaluate the harness and the model separately. Hold one fixed and vary the other. The team that does this consistently will iterate faster than the team running model-plus-harness as a black box.

Two years from now, harness architecture will be as well-understood as REST API design is today — a set of known patterns with known trade-offs, taught in courses, documented in RFCs. We’re not there yet. Right now, the teams who understand harness anatomy are building qualitatively more reliable agents than the teams who are just prompt-engineering around a while loop.

Further reading: Anthropic’s Claude Code system prompt post is the most complete public documentation of a production harness. The SWE-agent paper is the clearest research argument that ACI design — the harness — drives benchmark results independently of model choice. OpenHands’ architecture docs show how a harness can be designed to be strategy-agnostic.