Long-horizon agent failure modes: context rot, drift, looping
An agent that handles ten turns gracefully will likely collapse at fifty. The reasons aren't bugs — they're emergent properties of how transformers handle growing context. Here's a field guide to how agents disintegrate, with the public failure stories that mapped the territory.
Every team that builds an agent eventually runs the same experiment. The 10-turn demo works beautifully — clean tool calls, sensible reasoning, the model executes the user’s intent. They push the horizon. At 30 turns it still mostly works. At 50 turns something weird starts happening. By 100 turns the agent is running in circles, ignoring the original goal, or worse, declaring victory on a task it never actually finished.
The instinct is to blame the model. “We need a smarter base model.” Sometimes that’s true, but usually it isn’t. The same model running the same prompt at 10 turns versus 100 turns is producing radically different output for reasons that have nothing to do with reasoning capacity and everything to do with how the conversation is represented in the context window. The transformer doesn’t get dumber. It just stops being able to find the signal in the noise it’s been accumulating.
This post catalogs the four failure modes that show up reliably in long-horizon agent runs — what they look like, why they happen, and the production mitigations that the teams shipping real agents have settled on.
Why this matters now
Three years ago, the typical “agent” demo was a 5-turn interaction. Two years ago, it was 20 turns. Today, products like Devin run for hours and Claude Code’s background sessions for entire workdays. The horizons have grown, but the failure modes that ambush you at longer horizons have not been solved by smarter models — they’ve been mitigated by better engineering. Understanding the mitigations is what separates teams that can confidently ship a long-horizon agent from teams whose demos work and whose production runs catastrophically fail.
The cost of getting this wrong has also escalated. A 5-turn agent that fails costs maybe a few cents of compute and a frustrated user. A 200-turn Devin run that drifts into a yak-shaving spiral can cost tens of dollars per run and erode customer trust permanently. The economics of long-horizon agents make the engineering investment in failure-mode mitigation pay off — sometimes within a single avoided runaway run.
The four canonical failure modes
What “context rot” looks like in a real trace
A concrete example from a production logs review at a SaaS company in early 2026. The agent — a Claude-powered customer support escalation handler — was answering a complex billing question. The customer’s original message stated they wanted a refund processed under invoice INV-8847, dated April 11.
By turn 35, after the agent had pulled several invoices, examined payment records, queried the customer’s transaction history, and tried two different refund-processing tools, the agent issued a final action: process a refund against INV-8523. That invoice was the customer’s most recent invoice — never the one they asked about. The context window contained the full conversation; the original request was still in there, near the top, accurately stating INV-8847. But the recent turns were dense with discussion of INV-8523, and that’s what the model latched onto.
The fix wasn’t a smarter model. It was an orchestrator hook that re-prepended the customer’s original request before every action turn, plus an explicit cross-check: “the user asked about invoice X. you are about to act on invoice Y. is that correct?” That single addition cut the rate of wrong-invoice actions in support agents by an order of magnitude.
What an actual production trace looks like
A useful exercise for grounding the failure modes is to look at what a production trace actually contains. A typical 100-turn agent run might log:
- 100 LLM completions, with input/output token counts per turn.
- 200-400 tool calls, with arguments, results, and durations.
- 20-50 summarization or compaction events, where the orchestrator collapsed older turns into a summary.
- A handful of orchestrator interventions: re-grounding prompts, retry-limit hits, escalations to a human.
- Final state: success, failure with reason code, or timeout.
The teams who debug long-horizon failures effectively can replay this trace. They can see at turn 47 that a re-grounding nudge changed the agent’s trajectory; at turn 78 that a retry budget tripped; at turn 91 that the verifier flagged an issue. Without this level of observability, you’re guessing about which mode failed. With it, the post-mortems become tractable.
The implication for builders: invest in tracing infrastructure as early as you invest in the agent itself. The most successful agent products treat the trace UI as a primary tool — both for engineers debugging issues and for support staff explaining behavior to customers. The trace is the agent’s source of truth.
Specifically, mode 1 in coding agents
Coding agents like Claude Code and Devin hit context rot in a specific way: tool results from earlier turns (files read, search results) accumulate and crowd out the active task. A coding session that started with a clear bug report can end up with 30K tokens of half-relevant code in the context, and the original bug report — three turns of conversation up — is no longer the model’s primary focus.
The mitigation for coding agents specifically is more aggressive about pruning old tool results once they’ve been acted on. Once the agent has read a file, made an edit, and verified the edit succeeded, the original file content can be summarized down to “edited file X to change Y” — the original full content doesn’t need to stay in context. Claude Code’s subagent pattern essentially institutionalizes this; subagent calls return compact results to the parent, which avoids accumulating raw tool noise.
Pattern in the wild — the customer support agent
A customer support agent we observed: ten turns into a complex billing dispute, the agent had pulled multiple invoices, cross-referenced two database records, and constructed a partial refund plan. Eleven turns in, the agent decided to “do more research” and started searching for unrelated transactions. Twelve turns in, it had pulled additional records that weren’t part of the original dispute. The original refund plan, partially built, was no longer being discussed.
This is goal drift in a customer-facing context. The agent was technically making “progress” — pulling more data, considering more options — but the progress was no longer toward the original goal. The user’s actual question (process the disputed refund) was being lost in a thicket of side investigation.
The mitigation that worked: re-prepending the customer’s verbatim original message at the top of each system message after turn 5, and explicitly checking “is your current action moving toward refunding INV-X.” The metric improved from a 73% successful-resolution rate to 91% after the intervention.
The “infinite loop” anti-pattern
A specific anti-pattern that comes up enough to deserve its own callout: agents that get stuck in a productive-looking loop. The agent keeps doing things — calling tools, producing output — but the things never make actual progress on the underlying task.
This is harder to catch than a tight retry loop because each iteration looks like work. The agent searches the web, summarizes results, considers options, calls another tool. The output stream is busy. But the task itself isn’t being moved forward; the agent is generating motion without progress.
The fix is to make progress measurable from the orchestrator’s side. Not “did the agent do something” but “is the agent measurably closer to the goal than it was N turns ago.” If progress can be quantified (tests passing, file modified, ticket status changed, query executed), the orchestrator can detect lack-of-progress even when the agent is busy. If progress can’t be quantified, the agent’s task probably needs to be re-scoped into something that can be.
A general note on how to find these failures
Before diving into each mode, a meta-comment on how teams typically discover them. The pattern is usually:
- The agent works beautifully in demos. The team ships.
- Users start using it on real workflows. 95% succeed.
- The 5% that fail get logged but the team initially thinks they’re noise.
- After a few weeks, a particular failure shows up enough times that it gets investigated.
- The investigation reveals a structural failure mode, not a one-off bug.
- The fix is an architectural change, not a model change.
This pattern repeats across teams and products with remarkable consistency. The implication is that you can short-circuit the discovery process by knowing the modes in advance — which is what this post is about. The teams that go in expecting context rot, drift, recovery loops, and premature termination get to step 6 in days; the teams that don’t know the modes exist take weeks to recognize the patterns in their logs.
A note on terminology
Different teams use slightly different vocabularies for these failure modes. Anthropic talks about “context utilization” and “working memory.” OpenAI’s documentation uses “task scope creep.” LangChain’s blog uses “agent drift.” The academic literature splits them more finely (eight named modes in some catalogs). The four-mode taxonomy below is a synthesis that captures the most consequential failures across these vocabularies — useful as a mental checklist when you’re debugging a long-horizon agent that isn’t behaving.
The modes are not mutually exclusive. A single failing agent run often shows two or three at once: drift early, recovery loops in response to the drift, premature termination when the loops eat the budget. The mitigations are largely independent though — fixing recovery loops doesn’t help with drift, fixing drift doesn’t help with premature termination — so addressing them individually is the right pattern.
Mode 1 — Context rot
The most common, and the one that shows up first as the agent’s horizon grows. Information that arrives early in the conversation gets de-weighted as the context grows around it. This isn’t a metaphor — it’s a measurable property of transformer attention. The “needle in a haystack” benchmarks have shown for years that retrieval accuracy degrades in the middle portions of long contexts, even on models advertised with 200K+ token windows.
For an agent, this means the original user instruction — sitting at turn 1 — gets progressively buried under turns 2 through 80 of tool calls, tool results, and intermediate reasoning. By turn 50, the model is paying much more attention to the most recent few exchanges than to the goal stated at the top.
Karpathy’s context engineering thesis is the cleanest framing of why this matters: the context window is the model’s RAM. Unlike actual RAM, packing it densely actively degrades retrieval. The job of the orchestrator isn’t to push everything into context just because the window can hold it — it’s to keep only the information that’s relevant for the next step.
The production fix is summarization on overflow. Pick a token threshold (typically 50-70% of the model’s context window), and when the conversation crosses it, replace the oldest N turns with an LLM-generated summary. Three engineering details matter:
- The summary needs to be structured, not prose. A “compressed turn” with fields like
actions_taken,findings,open_questions,errors_seenretains the useful signal while shrinking the byte count by 5-10x. - The summary call should use the same model family as the agent itself — cross-model summaries introduce subtle vocabulary drift.
- Pin certain pieces of context as “never-summarize”: the original user goal, the tool schemas, and any user-provided constraints. These get re-prepended after summarization.
Anthropic’s own engineering blog on long-horizon agents describes exactly this pattern as their default for Claude’s research-style features, with the additional twist that the summary is itself versioned in case the agent later needs to “expand” a compressed turn for re-examination.
Why long context windows don’t fix this
The intuitive response to context rot is “use a bigger context window.” Claude has 200K, Gemini has 1M+. Surely if the window is big enough, the early content just stays there?
The empirical answer is: yes, it stays in the window, but the model’s attention to it degrades. The needle-in-haystack benchmarks have shown for years that retrieval accuracy from the middle of long contexts is well below retrieval from the ends. The model isn’t using the early content equally even though it’s still present.
Karpathy’s framing here is precise: the context window is RAM that gets noisier as you add to it. Adding more bytes doesn’t preserve old bytes — it actively interferes with retrieval of them. The implication is that bigger context windows do not solve the long-horizon problem; they just shift the threshold at which it manifests. An agent that disintegrates at 50 turns in 8K context will disintegrate at 200 turns in 200K context. The architecture matters more than the window size.
Mode 2 — Goal drift
Subtler and harder to detect. The agent solves a problem, but not the problem the user asked about. The mechanism is a compounding version of the locally-optimal-action trap: at each step the model picks the most plausible next action given the most recent context, and the most recent context is its own intermediate work. Over enough steps the trajectory bends away from the user’s actual goal toward whatever was easiest to make progress on.
A 2026 Cognition blog post on Devin’s autonomy catalogs this as one of the dominant Devin failure classes. Asked to “fix the failing tests in auth.py”, Devin would start there, encounter a missing dependency, attempt to install it, find a permissions issue, debug that, refactor an unrelated config file, and after ninety minutes return a “done” with a PR that touches the dependency manager and nothing in auth.py at all. Each step was locally reasonable. The aggregate was off-task.
The mitigation pattern that works is periodic re-grounding to the original goal. Concretely:
every N turns (typically 5-10), inject a system message:
"Original goal: <verbatim original user instruction>
Current step count: <n>
What you've accomplished so far: <short list>
Re-evaluate: is your current action moving toward the original goal?
If not, propose a corrective plan."
This is mechanical, repetitive, and works disproportionately well. The act of forcing the model to verbalize its current trajectory against the original goal is enough to surface drift to the model itself, which can then self-correct. It’s the same trick “extended thinking” modes deploy at a finer grain — a structured pause to recompute the plan.
A second mitigation: have the orchestrator (not the agent) explicitly compute and display “progress toward goal” as a tool result. If the agent is asked to fix three tests and zero are passing after twenty turns, that’s a signal the orchestrator can inject. The model can ignore many things but it cannot ignore a tool result in its context.
Detecting goal drift in production
The hard part of goal drift is that it’s invisible to single-turn evaluation. The agent’s individual actions all look reasonable; only the trajectory reveals the drift. The teams that catch it early share a similar instrumentation pattern:
- Log the original goal as an artifact attached to every turn. Every span/trace from the agent carries the original user instruction as a structured field. Drift detection runs over this — comparing late-turn actions to the original goal embedding.
- Score “on-goal” per turn. A small evaluator model classifies each agent action as on-goal, neutral, or off-goal. Aggregate metrics: % off-goal per session, time-to-first-off-goal turn.
- Alert on rising off-goal rates. A deploy that increases off-goal rates is suspect even if eval-set scores are stable. Drift detection in production catches what offline evals miss.
The interesting failure case is when off-goal turns are individually correct but collectively wrong. The agent is doing useful work, just not the work that was asked. A pure correctness check at the turn level won’t catch this. The trajectory-level “on-goal” classifier does.
Mode 3 — Recovery loops
The third failure mode is the easiest to observe in logs but the most expensive to leave unmitigated. A tool call fails. The agent retries. The retry fails differently. The agent retries again with a different parameter. That fails. After ten tries the agent has spent thousands of tokens and made zero progress on the actual task.
A canonical example, also from Cognition’s Devin writeups: Devin attempts to run a Python script. The Python version is wrong. It installs a different version. The install fails. It tries a different Python manager. That fails too. It tries apt-get. It tries brew. It tries to compile from source. Two hours and $40 of compute later, the original task — which had nothing to do with Python versioning — has been completely forgotten in favor of an environment-debugging rabbit hole.
The pattern here is recursive error compounding: a recovery action fails in a new way, generating more error context, which the agent treats as a new problem to solve, which generates more errors. The context window fills up with retry chains. By the time a human notices, the agent has burned through its budget on a self-inflicted yak shave.
The production answer is hard budgets at every level of the orchestration loop:
The key insight is that the agent should never be the entity enforcing its own budget. A model that’s already drifted is not going to suddenly produce calibrated self-restraint. The orchestrator — the deterministic outer loop — owns the kill switch. Cursor checkpoints its agents in SQLite and the orchestrator can revert; Devin has explicit ACU (Agent Compute Unit) budgets per task that the model cannot exceed; Claude Code’s Agent SDK exposes both step and token budgets as first-class parameters.
A war story from production
A team running an internal-tools agent on Claude Sonnet 4.6 caught a textbook recovery loop in their logs. The agent was asked to update a configuration value in a YAML file. The first edit failed validation. The agent reformatted and retried. Failed again. Reformatted differently. Failed. After 14 retries, it started “fixing” unrelated keys in the same file in case those were the problem. By the time the orchestrator’s token budget tripped, the agent had touched 23 keys in a config file that originally needed one trivial change.
The fix wasn’t a smarter agent — it was a hard retry cap of 3 on any single tool error, with a forced escalation to a “re-plan with simpler scope” turn after the cap. After deploy, the bad-config-mutation rate dropped to near zero. The model wasn’t getting better; the orchestrator was just refusing to let it dig itself deeper.
Why “just be patient” fails
A subtler variant of the recovery loop is when the agent does make progress on retries — but the progress is so slow that the wall-clock time exceeds anything reasonable. We’ve seen agents that recover from environment issues over 40 tool calls, eventually completing the original task, but burning $20 in compute and 90 minutes of wall time on a task that should have taken 5 minutes.
This is not technically a loop (each step is different), but it is a failure of scope. The orchestrator should look at not just “is the agent making progress” but “is the cumulative cost still proportional to the task’s value.” A $20, 90-minute solution to a 30-second human task is a failure mode even when the final answer is correct.
The mitigation is a value-to-cost ratio check on long-running tasks. If the cost has exceeded a threshold (say, 5x the typical task cost in dollars or tokens), the orchestrator escalates to human review even if the agent is technically still making forward motion.
Recovery loops in tool-using agents
A specifically common variant in tool-using agents: a failing tool call generates an error message, which gets added to context, which the model treats as new information to act on. The model writes a “fix” — a different tool call. That tool call also fails (different error, same root cause). More context. The model keeps trying tools.
The pattern in the trace is unmistakable: a sequence of tool_use / tool_result pairs where each tool_result is a different error message. The agent is generating actions in response to feedback, but the feedback isn’t getting it closer to a working state. Each failure adds entropy to the context without adding useful information.
The clean fix is a tool-error budget separate from the general retry budget. After N tool errors of any kind within a session, force a re-plan turn that explicitly says “the recent tool calls have failed. step back, identify the root cause, and propose a different approach — not another tool call.” Often the re-plan correctly identifies a setup issue that the iterative tool-call sequence was masking.
Mode 4 — Premature termination
The most dangerous and the hardest to detect, because it’s silent by definition. The agent decides — wrongly — that it has accomplished the task and emits a “done” or final answer. The user (or the orchestrator) gets a plausible-looking output. The actual work is incomplete.
This is the failure mode that turns agentic systems from “buggy” to “untrustworthy.” Bugs are recoverable; trust isn’t.
The root cause is almost always context-driven: the agent’s recent context contains evidence of progress (a successful tool call, a passing test, a written file) and the model conflates progress with completion. By turn 60 the original acceptance criteria are buried somewhere up top, and the model’s working memory is full of “things I just did,” which looks like a finished task.
The mitigation pattern is explicit verifiers external to the agent:
- Acceptance criteria as code. Before the agent starts, codify the goal as a test, a query, or a deterministic check. The orchestrator runs the verifier after the agent claims done. If verification fails, the agent goes back into the loop with the verifier’s failure output.
- Sub-agent reviewers. A separate agent instance, with only the original goal and the proposed final output (not the full conversation), is asked: “does this output satisfy the goal? what’s missing?” This is the LangChain pattern for evaluator-optimizer chains, and works because the reviewer doesn’t carry the optimizer’s drift.
- Hard “ask a human” thresholds. For high-stakes agents, configure the orchestrator to escalate to a human after N steps without verifiable progress, even if the agent claims it’s done.
Devin’s public review pages document this failure pattern at length — a “done” PR that touches the wrong file, or claims to fix a bug that the test suite confirms is still failing. The fix in Devin 2.x was an explicit verifier loop: the agent’s claim of completion is only accepted after a separate verifier session signs off.
The compounding effect — why minutes turn into cliffs
There’s a quantitative regularity that’s worth internalizing if you’re going to ship long-horizon agents. The research community has converged on a finding that’s both empirical and intuitive: doubling task duration roughly quadruples failure rate. This shows up in multiple independent evaluations and has been observed in the data from Devin, Claude Code, and academic agent benchmarks alike.
The mechanism is straightforward. Each step has some probability p of being on-track. Over n steps, the probability of being on-track at the end is roughly p^n. Small per-step errors compound geometrically; small per-step recovery rates fail to keep pace. A model that’s 95% reliable per step is only 36% reliable over 20 steps and 0.6% reliable over 100. This is why agents that work at 10 turns degrade at 50 and disintegrate at 200.
The practical implication is that the right horizon for an agent is much shorter than the technical horizon the model can handle. Just because Claude or GPT can hold a coherent context over 200K tokens doesn’t mean a single autonomous run should go that long. The teams shipping reliable agents structure their work into shorter sub-tasks with explicit checkpoints, deliberate context resets between sub-tasks, and human review at sub-task boundaries.
The “cliff in the 30-40 minute range” that the long-horizon benchmark research describes is a function of this compounding. Around half an hour, the cumulative error rate of a long single-run agent crosses a threshold beyond which most tasks are unrecoverable. Beyond that point, the user is more likely than not to receive an output that’s off-task in some way they’ll have to catch.
A useful pattern from the teams that work around this: rather than running an agent for an hour straight, run it for ten minutes, checkpoint everything to disk, summarize the state, and start a new agent session with the summary as the initial context. The agent that started “fresh” with a summary outperforms the agent that ran continuously, because the summary itself acts as a context-window reset against drift.
The verifier-as-a-second-agent pattern
The verifier idea deserves its own paragraph because the implementation matters a lot. The naive version — “ask the same model ‘is this output correct?’” — doesn’t work well. The model that produced the output is exactly the model most likely to confirm it, because the confirmation reasoning is contaminated by the production reasoning.
The version that works: a different model instance (or even a different model entirely), given only the original goal and the proposed final output (not the production conversation), asked to verify against the goal. This separation matters because the verifier doesn’t carry the drift; it sees the input and output cleanly, like a code reviewer who didn’t write the code.
For the highest-stakes uses, a still-better pattern: the verifier is a deterministic check (a test that runs, a query that executes, a regex that matches). Code can’t drift. The combination of a deterministic verifier plus an LLM verifier as backstop catches most premature-termination cases.
What the 2026 consensus looks like
Two years into the agent build-out, the teams that ship reliably have settled on a remarkably similar architectural shape. Karpathy named it “context engineering”; LangChain calls it “agent memory hygiene”; Anthropic talks about “working memory.” The substance is the same.
The convergence is striking because it didn’t come from a single paper or a single product team. It emerged independently across multiple production-grade agent systems as engineers ran into the same failure modes and arrived at structurally similar mitigations. The mitigations work because they address the underlying constraints — attention degradation in long contexts, geometric error compounding over many steps, the model’s inability to enforce its own budget — rather than hoping for better base models to make the problems disappear.
The specific patterns:
- The context window is a managed resource, not a free buffer. The orchestrator decides what stays in, what gets summarized, and what gets dropped. This is the difference between an agent that works for 200 turns and one that disintegrates at 50.
- The agent is not in charge of itself. Budgets, retries, summarization triggers, and termination criteria all live in deterministic code outside the model. The model proposes; the orchestrator disposes.
- Verifiers are mandatory. Any non-trivial agent that claims completion without an external verifier is, with high probability, going to ship a wrong answer eventually. The cheap fix is a second model call grading the first; the expensive fix is a customer-trust incident.
- Drift is a logged event. The teams that catch goal drift early are the ones treating it as a metric — “fraction of tasks where the agent’s last action was on-goal” is a real KPI at the places that scaled their agent products in 2026.
The interesting wrinkle is that very little of this needs frontier-model intelligence. A model from 2024 with a well-engineered orchestrator outperforms a frontier 2026 model with a sloppy one on 100-turn tasks. The “agent stack” in 2026 is much more about the engineering wrapper than the underlying weights — exactly the inversion that early 2024 demos missed when they assumed bigger models would simply absorb the engineering problem.
Long-horizon agents that work in 2026 share a common characteristic: their authors treated them less like chatbots and more like distributed systems. Retries, budgets, summaries, checkpoints, verifiers, escalation paths. The transformer is a powerful but unreliable subsystem, and the job of the orchestrator is to make the whole thing reliable despite that. That framing — “the model is a component, not the product” — is the unifying lesson of two years of agent failure stories.
The teams that internalized this two years ago are shipping agents that get used. The teams that hoped the next model release would fix their failure modes are still hoping. The gap between those two cohorts is wider in 2026 than it was in 2024 — the bar for “reliable enough to ship” has risen, and the engineering discipline required has gotten more demanding, not less. The model is the easy part; the wrapper that keeps it reliable across hundreds of turns and thousands of users is the hard part. That’s where the actual work lives.
Further reading: Karpathy’s context engineering thread, LangChain’s writeup on context engineering for agents, Anthropic’s engineering Claude infrastructure 2025 post, and the Cognition blog on Devin’s evolution. For a deeper dive into eight specific failure modes — including reward hacking and alignment faking — see this field guide to LLM agent failures. For the orchestration patterns that make this manageable, our five patterns post goes deeper on the orchestrator-worker structure.