datarekha
Patterns May 25, 2026

The five patterns that ship real agents in 2026

Anthropic's December 2024 paper became the production reference for agent design. Two years on, here's which patterns actually survived contact with shipped products — and why the simple ones won.

12 min read · by datarekha · agentsanthropicorchestrationproduction

When Anthropic published Building Effective Agents in December 2024, two things happened almost immediately. First, every AI Twitter account quoted the line about preferring “simple, composable patterns” over fancy multi-agent frameworks. Second, almost no one actually adopted it that way — the next eighteen months saw a Cambrian explosion of multi-agent “swarm” libraries promising autonomous agent teams.

By mid-2026, the receipts are in. The patterns from that paper are the ones running in production at Cursor, Replit, Devin, Anthropic’s own product surface, and most enterprise deployments. The swarms are still mostly demos.

This post walks through the five patterns, what each one actually solves, and the production behaviour that distinguishes them from the toy versions you see in every tutorial.

Why these five, and not the dozens of others

A useful definition first. In the Anthropic paper, an agent is “a system where an LLM dynamically directs its own processes and tool usage.” That’s the high bar — model decides what to do next. A workflow is “a system where LLMs and tools are orchestrated through predefined code paths.” That’s the lower bar — code decides what to do next, the LLM fills in the blanks.

Most “agents” you’ll meet in production are actually workflows. That’s not a bug, that’s the point. The five patterns below are arranged roughly in order of how much control they hand to the model:

CONTROL HANDED TO THE MODEL →PATTERN 1AugmentedLLMsingle call,tools + retrievalPATTERN 2Promptchainfixed steps,with gatesPATTERN 3Routingclassify,dispatch to specialistPATTERN 4Parallel+ aggregatefan-out subtasks,vote / mergePATTERN 5Orchestrator+ workersLLM plans tasks,re-plans on results
The five patterns, ordered by how much autonomy the LLM gets. Everything to the left of pattern 5 is a workflow; pattern 5 is the only one most people would call “an agent.”

The trap most teams fall into: they want to start at pattern 5 because it sounds most impressive. The teams that ship reliably start at pattern 1 and add complexity only when the simpler pattern measurably fails.

Pattern 1 — The augmented LLM

This is one model call with three augmentations available to it: tools, retrieval, and memory. No loop. No follow-up. One in, one out.

        user query


         ┌──────────┐    function calls
         │   LLM    │ ───────────────► tools (search, calc, sql, ...)
         │  + tools │ ◄───────────────  results
         │  + RAG   │
         │  + mem   │ ◄──────► retrieval
         └──────────┘


         final answer

If you’re reading this and thinking “that’s just a chatbot with tool use” — yes, exactly. The augmented LLM is the foundation of everything else. The production lesson is most user-facing AI features don’t need anything fancier than this. Pattern 1 powers most of Claude’s product surface, ChatGPT’s standard chat, the “ask anything” feature you’ve shipped in your SaaS dashboard.

The two operational habits that distinguish a good augmented LLM from a sketchy one:

  1. Each tool has a clear, single-purpose interface. search_docs(query: str, top_k: int = 5) -> list[Doc] is good. do_stuff(input: dict) -> Any is what your eval set will hate.
  2. The model knows when to not call a tool. This is overwhelmingly an instruction-tuning and system-prompt problem, not a framework problem. If your model calls web_search for “what’s 2+2”, your system prompt is failing it.

Pattern 2 — Prompt chaining

A fixed sequence of LLM calls where each step’s output feeds the next, with optional gate checks between steps.

  step 1          step 2           step 3
 ┌──────┐  out   ┌──────┐  out    ┌──────┐
 │ LLM  │ ─────► │ LLM  │ ──────► │ LLM  │ ──► final
 └──────┘  ▲     └──────┘  ▲      └──────┘
           │ gate          │ gate
           │ check         │ check
       (assert format)  (assert content)

The canonical example: outline → draft → polish a piece of writing. Each step’s prompt is purpose-built. Between steps, deterministic code asserts the output is well-formed (right JSON shape, contains required fields, hasn’t hit a refusal). On gate failure, you either re-prompt with the violation explained, or fail fast.

When prompt chaining wins: when the task decomposes cleanly and the quality gain from focused prompts beats the latency cost of multiple serial calls. Notion’s AI summarisation features, GitHub’s issue triaging, and most “AI writing assistant” features are some form of prompt chain.

When it loses: when the task isn’t really decomposable and you’re just splitting one prompt across three calls. That’s three latencies for no quality gain.

Pattern 3 — Routing

A classifier (often a small cheap model, often not an LLM at all) picks which of N specialist handlers should answer.

incomingqueryRouterclassifier / small LLMsmall / fast modelsimple queriescode-tuned modelprogramminglarge model + RAGhard / multi-hop
Routing is how production systems get away with using a small cheap model 80% of the time and a frontier model only when the query justifies the cost.

Anthropic’s own customer-facing product is a routing system: depending on the query, the same Claude conversation may use Haiku for fast simple turns and Sonnet/Opus for hard reasoning. The router itself runs in milliseconds.

The production lesson is that the router decision must be visible in your logs and easily overrideable by a human operator. The day you find a class of queries being misrouted, you want a config change, not a model retrain.

Pattern 4 — Parallelization + aggregation

Run N copies of (or N variants of) the same step in parallel, then aggregate. Two main flavours:

  • Sectioning — split a task into independent subtasks (summarize_each_section, translate_each_paragraph), run them concurrently, concatenate.
  • Voting / sampling — run the same prompt N times with different temperatures or different system prompts, then merge or majority-vote the results. Used heavily for safety filters (“does any of N reviewers flag this?”) and for math/code where verification is cheap.

The classic production example is code review agents. Run three reviewers in parallel, each with a different lens (“security”, “performance”, “API contract changes”), then merge the issues. The latency is governed by the slowest reviewer, not the sum.

Pattern 5 — Orchestrator + workers

This is the pattern most people mean when they say “agent.” A central LLM plans, dispatches subtasks to worker calls (which may themselves be patterns 1–4), reviews the outputs, and re-plans. The orchestrator decides the next step at every iteration.

Orchestratorplans → dispatches→ reviews → re-plansplan: ordered task listJSON, stored as stateWorker 1e.g. RAG search(pattern 1)Worker 2e.g. file-edit chain(pattern 2)Worker 3e.g. parallel reviews(pattern 4)results→ re-plan
The orchestrator-workers loop. Workers themselves can be implementations of patterns 1–4 — the recursion is what makes the whole thing scale.

Cursor’s Composer is this pattern. Replit’s agent is this pattern. Devin is this pattern but with a longer horizon (it can run for hours). The defining production characteristics are:

  • State is persisted between turns. If the orchestrator crashes, the next process can resume from the last checkpoint. Cursor checkpoints to a local SQLite; Devin checkpoints to its own cloud storage.
  • The plan is inspectable. Users see what the agent is about to do, and (in good products) can edit the plan before it runs. This is the difference between “agent” and “agent you can trust.”
  • Worker outputs always come back to the orchestrator as structured data. Markdown blobs are debugging hell at scale; structured outputs are what makes the orchestrator’s “re-plan” step tractable.

What about multi-agent swarms?

You may have noticed I haven’t mentioned AutoGen-style “agents talking to each other” or CrewAI-style “agent team with manager, researcher, writer.” That’s deliberate. Most of those systems are pattern 5 under the hood — one orchestrator dispatching tasks to specialists — but framed as autonomous peers. The framing matters because it changes how the user trusts the system.

The autonomous-peer framing is worse for production because:

  1. Cost is unbounded. Two agents going back and forth can rack up thousands of tokens before producing a result. Pattern 5 has an explicit plan, which is an explicit bound.
  2. Failures cascade silently. When the “researcher” agent hallucinates a citation and passes it to the “writer” agent, the writer has no way to catch it. Pattern 5’s structured return + orchestrator review does.
  3. Debugging is impossible. You can’t replay a multi-agent conversation in your head. You can replay an orchestrator’s plan.

The teams that have shipped real multi-agent systems (Anthropic’s research agents, Google DeepMind’s Project Astra) all use them in narrow domains with hard-coded protocols between agents. Not “agents collaborate naturally.”

Picking your pattern

A working flowchart, derived from watching dozens of agent projects ship or not ship over the last 18 months:

Can one prompt with tools+ retrieval solve it?yesPattern 1 — augmented LLMnoDoes it decompose intoa known step sequence?yesPattern 2 — prompt chainnoAre there distinct queryclasses with different handlers?yesPattern 3 — routingnoCan the work be done inparallel independent units?yesPattern 4 — parallelizenoPattern 5 — orchestrator
A decision tree. The “right” pattern is the leftmost one that works. Add complexity only when measurement says you must.

What to take away

Two years of production agent work, compressed into three lines:

  • Start at pattern 1. Stay there as long as it works. Most user-facing AI features never need anything else.
  • When you do escalate, escalate one step at a time, with evals. The teams shipping agents in production hold a hard rule: no new complexity without a measured win on the eval set.
  • Pattern 5 is where the magic can live, but it’s also where most teams burn their compute budget on demos that won’t ship. Reach for it last, not first.

The original Anthropic paper still rewards a careful re-read. The mid-2026 update would be: the simple patterns won, the swarms didn’t, and the teams that internalised “start small, measure, then escalate” are the ones who actually have agents in front of users today.


Further reading: Anthropic’s Building Effective Agents (December 2024) is the original. The LangGraph tutorial on agent patterns is the best implementation reference. For the most thorough comparison of frameworks, see Microsoft’s Agent Framework comparison post.

Skip to content