The supervisor-worker pattern: when one agent isn't enough

If you spent any time on AI Twitter in 2024 you came away convinced that multi-agent systems were the future. CrewAI, AutoGen, MetaGPT, Swarm, LangGraph Multi-Agent — every week another framework promised a team of specialists collaborating like a small consulting firm.

Eighteen months later, the production landscape tells a much simpler story. Almost every shipping multi-agent system in 2026 — whether built on LangGraph, Microsoft’s Agent Framework, AWS Bedrock Agents, or hand-rolled Python — is one specific topology: a supervisor that plans and dispatches, and workers that execute and return. Everything else is either an elaborate version of this, or a demo that didn’t make it past the design review.

This post is about why that single pattern won, when you actually need it, and the four failure modes that bite teams who reach for it too early.

The pattern in one diagram

The supervisor-worker pattern (also called orchestrator-worker, hierarchical agents, or “manager-employee” in CrewAI’s vocabulary) has three moving parts: a central LLM that owns the plan, a set of worker agents with narrower scope, and a shared state object that travels between them.

Supervisor owns the plan and the global state. Workers are functions: deterministic system prompts, narrow tool sets, structured returns. The dotted line is what makes it an agent — the supervisor reads worker output and decides what’s next.

The structural reason this pattern wins over peer-to-peer alternatives is that the supervisor is the explicit bottleneck. There is exactly one place where the next action gets decided. You can log it, you can debug it, you can let a human edit the plan before it runs. None of those are true in a system where agents “naturally collaborate” by passing messages to each other.

When you actually need it

Single-agent systems handle far more than people give them credit for. A GPT-4-class model with 30 well-described tools, a clear system prompt, and a sensible eval set will beat most three-agent CrewAI setups on accuracy, latency, and cost. The supervisor pattern is not a free upgrade — it adds 1.5-2x latency and 2-3x token cost relative to a single agent on the same task, according to LangChain’s own production guidance.

The honest rule of thumb, after watching dozens of teams ship or fail to ship: reach for the supervisor pattern only when sub-tasks need fundamentally different system prompts and tool sets.

Concrete triggers from production systems:

Different output formats per sub-task. SQL worker returns rows; writer worker returns prose; chart worker returns SVG. Trying to coax one prompt to switch between these reliably is harder than splitting them.
Different tool surfaces with security boundaries. Your customer-service workflow might have a “lookup” agent that can read CRM data but can’t write to it, and an “update” agent that can write but only after explicit approval. The boundary is enforced by giving each worker a different tool set.
Different model tiers. A planner worth a Sonnet/Opus call dispatching to Haiku workers for cheap parallel reads is a routinely sensible architecture; cramming both into one prompt either over-pays or under-thinks.
Different memory/context scopes. The “research” worker should not see the user’s private files. The “personalize” worker should. Workers each get the slice of state they’re allowed to see.

If none of those apply — if your “sub-agents” all use the same model, the same tools, and produce text — you almost certainly want a single agent with a longer prompt or a prompt chain.

The LangGraph supervisor in practice

LangGraph’s langgraph-supervisor package is the most-deployed reference implementation in 2026. The mental model is straightforward: workers are subgraphs (themselves often ReAct agents); the supervisor is a node that picks the next worker by name.

from langgraph_supervisor import create_supervisor
from langgraph.prebuilt import create_react_agent

sql_worker = create_react_agent(
    model=haiku,
    tools=[run_sql, list_tables],
    prompt="Return rows as JSON. Never narrate.",
    name="sql_worker",
)

research_worker = create_react_agent(
    model=haiku,
    tools=[web_search, fetch_url],
    prompt="Cite every fact with a URL. Return JSON list.",
    name="research_worker",
)

writer_worker = create_react_agent(
    model=sonnet,
    tools=[],
    prompt="Compose a markdown report from inputs in state.",
    name="writer_worker",
)

supervisor = create_supervisor(
    model=sonnet,
    agents=[sql_worker, research_worker, writer_worker],
    prompt=(
        "Decompose the request, dispatch to workers in order, "
        "and synthesize via writer_worker last. Stop when done."
    ),
).compile(checkpointer=postgres_checkpointer)

Two things in this snippet that production teams take seriously:

The supervisor’s prompt is short and procedural, not flowery. “Decompose, dispatch, synthesize, stop” is the entire algorithm. The fancier the supervisor prompt, the more it second-guesses itself and racks up turns.
The checkpointer is real infrastructure. In production the checkpointer is Postgres, Redis, or DynamoDB — not the in-memory default. A supervisor run is resumable; if your process dies after worker 2, the next run picks up at worker 3 with all state intact. This is the single most underrated property of the pattern.

The compiled graph in LangGraph terms is just a state machine where the supervisor node has edges to each worker node, and every worker has an edge back to the supervisor. The supervisor’s output schema decides which edge to take. That’s the whole magic.

When it shipped: three real examples

Block’s Goose agent (open-sourced January 2025) is a desktop coding/automation agent that internally uses a supervisor loop plus a handful of “extensions” (their term for workers). Goose’s production lesson, surfaced in Block’s writeup, is that the supervisor prompt should explicitly enumerate “stopping conditions” — without them, the loop runs forever debugging itself.

Anthropic’s research preview of Claude’s “Research” feature (the multi-agent version that landed in 2025) was disclosed in their engineering blog to be a supervisor with parallel research workers, plus a final synthesizer. They reported the multi-agent variant improved task completion by ~90.2% over a single Claude on the same internal benchmark — but at roughly 15x the token cost. That tradeoff is why this pattern lives in “research mode” features, not in every Claude turn.

LangGraph’s own customer success teams publish a pattern they call “deep agents” — a supervisor with a long-running scratchpad worker, a filesystem worker, and a planning worker. The LangGraph multi-agent tutorial is the canonical reference; if you’ve seen one supervisor-worker diagram on the internet in the last year, it was probably from that tutorial.

What’s missing from this list is the “fully autonomous agent team collaborating like humans” use case. Even Anthropic’s Research feature is not that — it’s a parametrically-bounded plan executed by parallel workers. The supervisor decides when to stop. The workers do not chat.

Cost and latency: the honest numbers

The most common surprise for teams adopting this pattern is the cost math. Here is a representative back-of-envelope for a moderately complex task (planning + 3 worker calls + synthesis), using mid-2026 prices:

Indicative numbers, drawn from LangChain’s planning guide and Anthropic’s research-system writeup. Cost grows faster than latency because parallel workers compress wall time but not tokens.

The latency math is forgiving — parallel workers cut wall time even when they don’t cut tokens. The cost math is brutal. A supervisor that dispatches three workers and synthesises pays for at least five LLM calls where a single agent would pay for two or three. If your unit economics need a margin, you cannot ship the supervisor pattern naively. The production cure is usually: planner runs on the expensive model, workers run on the cheap one.

The four failure modes nobody warns you about

After enough postmortems, the patterns of supervisor-worker failure cluster into four buckets.

1. The supervisor becomes the bottleneck

Every worker output round-trips through the supervisor to decide the next step. If the supervisor is on a slow or expensive model, each turn pays that tax. The pathological version is a supervisor that re-reads all prior worker outputs every turn (“I should double-check the plan…”) and balloons context. Cure: explicit short-form state, structured worker returns, and a step_index in state so the supervisor doesn’t re-derive its position from scratch.

2. Workers hallucinate handoffs

A worker, asked to “research X then pass to writer,” writes its own markdown summary at the end, because the prompt felt incomplete without one. Now the writer worker has two inputs to merge — the supervisor’s brief and the previous worker’s accidental summary — and the synthesised report contradicts itself. Cure: workers return only structured data keyed to the supervisor’s plan. No “by the way” output.

3. Plan inflation

Given freedom, supervisors plan 12 steps when 3 would do. Each step is another LLM round trip. Cure: cap the plan length in the supervisor prompt (“Output at most 4 steps”) and reject longer plans in code. This is the single biggest cost lever in practice.

4. Lost middle

When a supervisor accumulates 10+ worker results in context, the model develops the same “lost in the middle” pathology that plagues long-context RAG. The first and last worker results dominate; the middle ones get ignored in the final synthesis. Cure: don’t accumulate. Summarise as you go, or have workers write to a structured scratchpad and have the synthesiser read that, not the raw transcript.

Where this pattern is going

The interesting near-term evolution is supervisors that compile plans into code instead of step lists — closer to LLMCompiler than to plain LangGraph — and execute the code in a sandbox where workers are just function calls. Cloudflare’s code-mode work hints at this; so does LangGraph’s “compiled plan” preview. The token savings are real: a 12-step plan-list might be 800 tokens; the equivalent JS snippet calling worker functions is closer to 200. And the supervisor doesn’t have to re-derive the plan each turn — it just runs the code.

The deeper shift is recognising that the supervisor-worker pattern is fundamentally an orchestration pattern, not a “society of agents.” The framing matters because it sets the right expectations: you are building a state machine with LLMs in the boxes, not assembling a team. State machines have testable transitions and inspectable state. Teams do not.

What to take away

Three lines worth tattooing on the wall above your IDE:

The supervisor-worker pattern is the only multi-agent topology that ships at scale in 2026. Everything else is either a variant of this or a demo.
Adopt it only when sub-tasks need genuinely different prompts and tools. Two workers with the same prompt and tool set is a single agent in disguise, and you’ll be paying for the disguise.
The supervisor’s prompt should be procedural, the plan should be short, and the workers should return structured data. Get those three right and the rest is plumbing.

Two years of multi-agent hype have left us with one production-grade pattern and a graveyard of swarms. That is not a tragedy — it is the normal trajectory of any new abstraction: the noisy maximalism phase, the sober winnowing, and the one pattern that survived because it matched how software is actually built.

Further reading: the LangGraph supervisor tutorial is the best implementation reference. Anthropic’s Building Effective Agents (December 2024) introduced the orchestrator-worker terminology; their multi-agent research system writeup is the most honest production retrospective in the space. AutoGen’s original paper is historical context — the framework itself is now in maintenance mode, with Microsoft’s Agent Framework as the successor.