Why ReAct lost — and structured planning won

For most of 2023, every agent tutorial on the internet started the same way. A loop. The model thinks (“Thought: I need to find the population of France”). The model acts (“Action: search(‘France population’)”). The system returns an observation. The model thinks again. Repeat until done.

That pattern came from a single paper — Yao et al., ReAct: Synergizing Reasoning and Acting in Language Models — and for about eighteen months it was synonymous with “agent.” LangChain’s default AgentExecutor was a ReAct loop. Half the tutorials in 2023 were some flavour of ReAct. The pattern’s name became part of the vocabulary.

By 2026 the production landscape has moved on. The agents shipping inside Cursor, Devin, Replit, GitHub Copilot agent mode, Claude Code, and the LangGraph deep-agents reference architectures do not run a naive ReAct loop. They run some form of plan-then-execute, often with the plan compiled into a DAG of tasks, sometimes with the plan compiled into code. ReAct survives as a fallback inside individual nodes — but the top-level orchestration has shifted.

This is a post about why that shift happened. The short version: ReAct’s “reason at every step” property is a feature in toy examples and a tax in real workflows. The long version follows.

What ReAct gets right (and what it costs)

The 2023 ReAct paper landed because it combined two ideas — chain-of-thought reasoning and tool-augmented language models — into one loop, with plausible benchmark gains on HotpotQA, AlfWorld, and WebShop. The formulation is elegant. The model interleaves thoughts (internal reasoning) with actions (tool calls), and gets to “see” the result of each action before reasoning about the next.

Thought: I need to find the population of France.
Action: search("France population 2024")
Observation: France has a population of approximately 68.4 million.
Thought: Now I need to compare this to Germany.
Action: search("Germany population 2024")
Observation: Germany has a population of approximately 83.2 million.
Thought: Germany has the larger population. Final answer: Germany.

Three things are true about this loop, and the third is what eventually killed it as a default.

It works for short tasks. On a two-step lookup like the above, ReAct is dramatically better than no-tools chain-of-thought. The model can integrate real-world data into its reasoning.
It’s interpretable. Every step has an explicit thought string. Debugging is straightforward — you read the trace.
It’s quadratic in tokens, and the trajectory is non-deterministic. Every step appends a thought, an action, and an observation to the context. By step 10, the context has all 30 of those. By step 20, it has 60. Worse, the model uses the entire history to choose the next action, which means token cost scales like O(n²) in the number of steps, not O(n).

That third property is the production killer.

The two failure modes nobody warned you about

1. Context bloat

A long ReAct trace is mostly observations, and observations are noisy. A web search returns a page snippet. A SQL query returns rows. A file read returns content. After 8-10 steps, the conversation context is mostly stale tool output that the model has to wade through to decide what to do next.

The empirical effect: as the trace grows past about 4,000 tokens, the model’s accuracy on choosing the right next action drops sharply. This is “lost in the middle” — the same phenomenon that plagues long RAG contexts — happening inside a single agent run.

2. Accidental loops

The ReAct loop has no explicit plan. The model decides what to do next based on what it has done so far. In production this produces two characteristic failure modes:

Recursive backtracking. The model tries something, doesn’t like the observation, tries a variation, doesn’t like that, tries something close to the first thing again. Every iteration adds context. Eventually the agent times out or the user cancels.
“Confirmation” loops. The model retrieves a fact, then decides to verify it by retrieving again, then decides to double-check by retrieving a third time. Cost balloons; correctness doesn’t improve.

There is no architectural mechanism in ReAct that says “we already covered this; move on.” The only way out is to constrain it from the outside — max iterations, repetition detection — which is what every real ReAct deployment ended up doing, badly.

What plan-then-execute changed

The first credible alternative was Plan-and-Solve prompting, which simply asked the model to write a plan first, then execute it. But the bigger conceptual shift was ReWOO (“Reasoning WithOut Observation”), which made the planner blind to observations.

The ReWOO architecture has three parts:

ReWOO’s architecture. Planner writes a complete plan referencing variables like #E1 for tool outputs. Worker executes them — in parallel when independent. Solver synthesizes. The planner never sees observations, so context stays small.

The key trick: the planner writes the entire plan in one go, referencing variables it cannot see yet. Steps that don’t depend on each other can run in parallel. The solver reads all the results at the end and synthesises the final answer. The model is invoked twice — once to plan, once to synthesise — regardless of how many tool calls happen in between.

The ReWOO paper reports 5x token efficiency and 4% accuracy improvement on HotpotQA versus ReAct. The 5x is the headline number; in practice the savings are larger on longer tasks because the quadratic blowup is what ReWOO eliminates.

LLMCompiler: the DAG-of-tasks evolution

ReWOO’s planner writes a list. LLMCompiler (Kim et al., December 2023) extended this by having the planner write a directed acyclic graph of tasks, with explicit data dependencies. The DAG is then scheduled by a Task Fetching Unit that runs independent nodes concurrently.

The reported numbers from the LLMCompiler paper:

2.27× speedup on the ParallelQA benchmark vs ReAct.
4.65× cost reduction in dollar terms.
Up to 9% accuracy improvement on certain task classes.
Up to 3.6× speedup on broader tool-use benchmarks.

The conceptual leap is recognising that an agent’s plan is a program, and a program is best represented as a dependency graph. Once you have the DAG, you can schedule it like any other compute graph. The same ideas that powered Dask, Airflow, and TensorFlow’s graph execution apply directly.

A worked comparison

Take a concrete prompt: “Compare the populations of France, Germany, Italy, and Spain, and tell me which has the highest GDP per capita.”

Naive ReAct will run roughly this trajectory:

Thought / Action: search("France population") → observation.
Thought / Action: search("Germany population") → observation.
Thought / Action: search("Italy population") → observation.
Thought / Action: search("Spain population") → observation.
Thought / Action: search("France GDP per capita") → observation.
Thought / Action: search("Germany GDP per capita") → observation.
Thought / Action: search("Italy GDP per capita") → observation.
Thought / Action: search("Spain GDP per capita") → observation.
Thought / Final answer.

Nine sequential LLM calls. Each one carries the growing history. The total token cost is roughly proportional to 9² = 81 in the worst case, because the final call sees all prior context.

A plan-execute version runs:

Plan: “search population for each of FR/DE/IT/ES, search GDP per capita for each, then compute.” One LLM call.
Execute: 8 tool calls — but they all run in parallel because none depend on each other. Wall time is the slowest single call.
Solve: read all 8 results, produce final answer. One LLM call.

Two LLM calls plus a flat batch of tool calls. The token cost is roughly linear; the wall-clock latency is the latency of one search, not eight.

For this particular task you might see something like:

Approach	LLM calls	Total tokens	Wall time
ReAct	9	~12,000	~22 s
Plan-Execute	2	~3,500	~5 s
LLMCompiler (DAG)	2	~3,200	~3.5 s

Numbers are illustrative, but the structural ratio holds: plan-execute beats ReAct by 3-5x on cost and latency for any task with multiple independent steps. It’s not even close.

When ReAct still wins

The honest counterpoint: there are tasks where you cannot plan ahead, because each step’s necessity depends on what previous steps return. “Debug this code” is the canonical example. The model runs the test, sees the error, decides what to fix; the next action is inherently conditional on the observation.

For these tasks, plan-and-execute degrades into “plan with one step, execute, replan.” That’s the LangGraph plan-and-execute with re-planning pattern, and it is genuinely useful when the task is fundamentally branching. But notice that even here, you’ve made the re-planning explicit and bounded, which kills the accidental-loop failure mode.

In production, the right rule is:

Decision tree: which loop topology does your task actually need? Naive ReAct is reserved for the genuinely exploratory case, and even there it ships with hard caps.

The narrow remaining win for ReAct-style loops is in coding agents, where each tool observation (test output, lint result, compile error) materially changes the next action. Even there, the production-grade implementations — Cursor’s Composer, Claude Code, Devin — wrap the ReAct inner loop in an explicit outer plan and ratchet down the max iterations aggressively.

What the production agents actually do

A non-exhaustive survey:

Cursor Composer runs a top-level plan (“read these files, edit these regions, run tests”) with ReAct-style inner loops only within the “run tests until they pass” sub-task. The outer plan is structured; the inner loop is bounded to ~5 iterations.
Claude Code has explicit TodoWrite / planning tools the model is encouraged to use early. The pattern is plan-first, execute-with-tools, summarise-at-end. ReAct survives inside individual sub-tasks.
Devin plans in advance, stores the plan as inspectable state, and re-plans on failure. Long-horizon runs are explicitly bounded by plan-step count, not by token budget.
LangGraph’s deep-agents reference is built around a plan-execute supervisor with sub-graphs as workers.
Anthropic’s Building Effective Agents never recommends naive ReAct. The “orchestrator-workers” pattern they describe is structurally plan-and-execute.

ReAct is not gone from the codebase; it’s just been demoted from orchestration layer to inner-loop primitive. The framework chapter header that used to say “Agents (ReAct)” now says “Agents (Planning)” in everything but the legacy LangChain docs.

The contrarian opinion

The ReAct paper deserves credit for kicking off a wave of research, and its loop is genuinely a useful primitive for some sub-tasks. But the field’s adoption of ReAct as a default was a mistake — driven by tutorial momentum and the simplicity of the algorithm — that cost the ecosystem two years of building agents that were quietly more expensive and less reliable than they needed to be.

The teams that read the ReWOO and LLMCompiler papers in 2023 and 2024, and switched their orchestration to plan-then-execute then, were shipping cheaper and more reliable agents while the rest of the field was still iterating on ReAct prompts. By 2026 that gap shows up in shipping products vs unshipping demos.

Two lines for the wall above your IDE:

If you can plan, plan. The token savings and parallelism are not optional in production.
If you can’t plan, plan-and-replan with hard bounds. The naked ReAct loop is a debugging tool, not an architecture.

The pattern that won was the one that treated agents as programs — DAGs of tasks compiled by an LLM and executed by a runtime — rather than chatbots in a thinking loop. That framing is what unlocked the cost, latency, and reliability gains that production agents finally deliver in 2026. ReAct, for all its formative importance, did not.

Further reading: the original ReAct paper is still worth reading for the framing. ReWOO and LLMCompiler are the two papers that made the production case for plan-then-execute. LangChain’s Plan-and-Execute Agents blog post is the most accessible practitioner guide. Anthropic’s Building Effective Agents is the production reference.