Why multi-agent swarms keep failing to ship

If you sat through any AI conference in 2024 or 2025, you saw the demo. A “manager” agent gets a high-level task. It spawns a “researcher” agent to gather information, hands the result to a “writer” agent, then a “reviewer” agent edits the output. Each agent has a persona, a name, a job. The conversation between them, displayed in colour-coded panes, looks like a tiny office. The presenter says the magic word — “autonomous” — and the audience claps.

Now ask the presenter to ship that exact system to production. Ninety times out of a hundred, they will quietly switch to a different architecture before it touches a real customer. The “researcher / writer / reviewer” team in the demo will become a single LLM call with a careful prompt and a couple of tools. Or it will become a deterministic pipeline with a quality gate. Or, if the task really does need an agent, it will become a single orchestrator LLM dispatching sub-calls and re-planning — with no peer-to-peer conversation at all.

This isn’t a “frameworks are bad” essay. AutoGen, CrewAI, LangGraph, Microsoft’s Agent Framework — these are useful pieces of software, especially as prototyping environments. But the specific shape they push by default — peer agents talking to each other — has a remarkably consistent record of not making it to production. The reasons are worth dissecting, because they tell you something durable about how LLM systems should be designed.

What “multi-agent swarm” actually means

Some terms first, because the word “multi-agent” gets stretched.

In this piece, a multi-agent swarm is a system where:

Multiple LLM-driven agents exist as distinct conversational entities.
They communicate primarily in natural language, agent-to-agent.
Each agent has its own persona, role, or specialty, often expressed as a system prompt.
Coordination emerges from the conversation, not from explicit code.

That is the AutoGen GroupChat pattern, the CrewAI Crew pattern, and roughly half of the agent papers published in the last two years. It is not the same thing as orchestrator-workers, where one LLM-driven orchestrator decides what to do next and dispatches subtasks to workers that may themselves be LLM calls. The orchestrator never confuses itself for a peer; the workers never directly talk to each other. The structure looks superficially similar from the outside but is fundamentally different inside.

Same components, different topology. The peer mesh on the left has quadratic communication paths and uses prose as protocol. The hub on the right has linear paths and uses structured returns.

Failure mode 1: cost is unbounded

A peer-to-peer agent conversation can, in principle, run forever. The manager asks the researcher for more details, the researcher consults the reviewer about scope, the reviewer asks the writer about format, the writer asks the manager what the priorities are. Each turn is an LLM call. Each call is tokens-on-the-meter.

In a swarm, the stopping condition is itself negotiated between agents in natural language. That is roughly the worst possible place to put a stopping condition, because LLMs are biased toward continuing the conversation. Their training data is full of dialogue, and dialogue that ends abruptly is rare. The default behaviour of any chat-trained model, given an open conversational loop, is to keep talking.

The empirical result, which anyone who has deployed CrewAI past the hello-world stage can confirm: token spend per task has a long tail. Most tasks finish in normal time. A small percentage spiral into multi-thousand-turn conversations and burn through your monthly inference budget before the on-call wakes up to kill the loop.

Orchestrator-workers does not have this problem because the orchestrator’s plan is the stopping condition. The plan has steps. The plan completes when the steps complete. The orchestrator can decide to re-plan, but each re-plan is itself one call with bounded output. There is no “talk until done” mode.

Failure mode 2: silent failure cascades

The harder problem is what happens when an agent makes a mistake.

In a swarm, the researcher returns a hallucinated citation. The writer incorporates it into the draft. The reviewer reads the draft and checks it for style, because that is the reviewer’s specialty. The hallucination ships.

This isn’t a contrived case. It is the modal failure of every researcher-writer agent demo. There are two reasons:

Each agent trusts the natural-language outputs of the others. A structured return from a sub-call can be validated by code. A prose paragraph from a “researcher” agent cannot.
No agent has a wide enough view to catch the error. The manager sees the conversation summary. The reviewer sees the draft. Nobody sees the original facts plus the final claim, side by side, with enough specificity to spot the drift.

Orchestrator-workers structurally protects against this. The orchestrator keeps every sub-call’s structured output in its state. When it composes the final answer, it has access to the raw research result and the draft and the review. A consistency check at this point — deterministic code or one focused LLM call — catches the hallucination before it ships. This is not a clever trick; it falls out of having one node that holds the full picture.

Failure mode 3: debugging is impossible

You ship a swarm. A user reports that the output was wrong. You open the logs. There are eleven panels of LLM-generated dialogue, full of exchanges like:

Manager: Researcher, I think the previous answer was a bit too focused on the technical side. Can you reconsider with the audience in mind?

Researcher: Good point, let me revise…

You read these for forty minutes. You still cannot tell where the output went wrong. The “decisions” are not localised — they emerge from many turns of agent dialogue, and any one of those turns is a plausible suspect. The natural-language exchanges look like they explain things, but they don’t — they’re post-hoc rationalisations the model produced because it was prompted to “explain your reasoning.”

Cognition has been particularly direct about this:

Whenever an agent does something, it makes a decision based on its context. When the agent’s context is different from the other agents, they get out of sync. When agents are out of sync, they make decisions that conflict.

The orchestrator-workers shape produces logs that look like a real state machine. The orchestrator at step 4 decided to call Worker B with these inputs. Worker B returned this structured result. The orchestrator at step 5 decided to ignore field X. Each decision is at one place, with the inputs available. Debugging looks like reading a trace, not reading a conversation.

Failure mode 4: natural language is a bad inter-process protocol

If you squint, a multi-agent system is a distributed system, and distributed systems have been studied for fifty years. The whole field learned, painfully, that:

Communication protocols need to be specified.
Errors need explicit codes, not freeform descriptions.
Schema mismatches need to fail loudly, not be papered over.
“Tell me again, but better” is not a transport mechanism.

Swarms forget all of this. The “API” between agents is whatever prose the previous agent decided to write. The “schema” is “the next agent will figure it out.” Errors are described in English (“I couldn’t find the answer”), which the next agent then re-interprets, possibly inventing details. There is no version field. There is no contract.

Anysphere’s posts on Cursor’s architecture keep coming back to this point. When they want one model to convey an edit to another model, they don’t use prose — they use a structured edit hint, well-defined enough that the second model can fail loudly if it doesn’t parse. That’s the difference between a system that ships and a system that doesn’t.

”But multi-agent works for me”

You will find counterexamples. Hugging Face’s agentic research demos, some academic multi-agent benchmarks, and a handful of internal research tools at frontier labs all use multi-agent shapes. They work in those settings for specific reasons:

The task is narrow and the protocol is hard-coded. Anthropic’s research agents pass tasks to each other in strictly structured JSON, with validated schemas. That is not “agents in conversation” — that is orchestrator-workers with multiple levels.
The tolerance for cost is high and the latency budget is generous. Research benchmarks can afford to burn 50,000 tokens per task. Your product probably can’t.
The team understands they’re not building a product. Multi-agent research environments are for research. They are very useful in that role.

The trap is borrowing the shape into a customer-facing product without the supporting machinery. Schema discipline, cost ceilings, hard recovery paths — these are not optional in production, and they are exactly what the multi-agent framing leaves implicit.

An incomplete but representative scorecard. The pattern that has shipped is the same pattern across companies, frameworks, and domains.

What good agent design borrows instead

If you started a project today and wanted to not end up with a multi-agent swarm that doesn’t ship, the contrast points are useful:

One LLM holds the plan. Other LLMs do work. Workers do not chat with each other.
Sub-call returns are structured. JSON, validated. If the model returned prose, the orchestrator’s parser fails loudly and either retries or escalates.
Cost is bounded by the plan, not by a conversation. Plans have a step budget. Re-plans count against the budget.
State is checkpointed in code, not in chat history. The plan and the working state live in your database / SQLite / KV store. Not buried in a transcript.
Failures are caught by deterministic checks first. Other LLMs only enter the loop to recover from failures, not to second-guess every step.

These constraints look like a step back from “fully autonomous AI agents.” They are also exactly the constraints that the shipped products have arrived at, independently, from many directions.

What to take away

The shape of a system matters more than the model behind it. A weak model in a well-shaped pipeline routinely beats a strong model in a swarm.
Peer-to-peer agent conversations are a research idea, not a production pattern. They cost too much, fail silently, and resist debugging.
Orchestrator-workers gives you everything the swarm pitched, with protocols. One planner. Many workers. Structured returns. Checked consistency. Bounded loops.
If a framework’s first example is “give your agent a persona,” treat that as a warning sign. Personas are decorations on top of whatever the real architecture is. If the real architecture is a swarm, the personas don’t save it.

The multi-agent dream is not wrong, exactly — it is just early. Once we have agents that can communicate over real protocols, validate each other’s outputs structurally, and run inside hard cost and time bounds, swarms will probably make sense. We are not there. In 2026, the way to ship is the boring way: one planner, structured workers, hard ceilings, and an honest eval loop telling you when you have permission to add another moving part.

Further reading: Cognition’s Don’t Build Multi-Agents is the most useful primary source on this. Anthropic’s Building Effective Agents covers orchestrator-workers as a first-class pattern. LangChain’s when to use multi-agent post is honest about the trade-offs from a framework that supports both.