Devin's architecture, anatomised
Cognition's launch demo promised a 'fully autonomous software engineer.' The product that actually ships is more interesting — a planner-executor split running in a sandboxed VM, with explicit memory windows, hard pruning rules, and a shockingly disciplined view of where the model is allowed to be creative.
The Devin launch demo, in March 2024, looked like a magic trick. A natural-language ticket went in; a working pull request came out twelve minutes later. The agent had browsed StackOverflow, read API docs, written code, run tests, debugged a failure, and shipped. The internet’s response was roughly evenly split between “this is the future” and “this is a Potemkin village.” Both reactions, two years on, look slightly off.
What Cognition shipped — and is still shipping into Devin and SWE-bench Verified results — is much more disciplined than the demo suggested. The product is not a magic trick. It is a careful answer to one question: what does it take to run an LLM agent for hours without it drifting into nonsense?
The answer is “more constraint, not more autonomy.” This post takes Devin apart and looks at what the constraints actually are.
The shape of a long-horizon agent
Most production “agents” run for under a minute. A Cursor Composer call, a Replit fix-it loop, a Claude tool-use turn — each completes in seconds, maybe a minute at the outside. The work they do has a tight feedback loop: small step, observable result, immediate adjustment.
Devin runs for hours. That single fact reshapes everything about its architecture. When an agent has to operate over a 200-step trajectory, problems that don’t exist at 5 steps become dominant:
- Context overflow. A naive “append every tool result to the conversation” approach runs out of context in 20 minutes.
- Decision drift. Each step the model makes a small choice based on a slightly fuzzier recollection of the original goal. By step 80, the agent is solving a problem its earlier self would not recognise.
- Compounding errors. A wrong call at step 30 corrupts the inputs to steps 31..200. Without an explicit recovery mechanism, the trajectory becomes unsalvageable.
- Unbounded cost. Two hours of an agentic loop is a lot of tokens. At frontier model prices, an unhinged Devin run is an expensive mistake to make twice.
Cognition’s posts on the agentic stack make these constraints explicit. The architecture below is what falls out when you take them seriously.
The planner-executor split
The most important architectural choice in Devin — the one most copy-cat projects skip — is that planning and execution are separate LLM calls with different prompts, different context budgets, and different models.
The planner sees the ticket, the current plan (if any), and the latest status report from the executor. It produces or revises a structured plan: ordered steps, each with a goal and a success criterion. The plan is stored as JSON in the agent’s persistent state, not as freeform text in a conversation. This means:
- The plan is inspectable in the UI. Users see exactly what the agent intends to do.
- The plan is editable in the UI. Users can correct, reorder, or remove steps before the executor touches them.
- The plan survives crashes. If the executor dies or the run is paused, the next invocation resumes from the same JSON.
The executor sees the current step of the plan, the relevant slice of working memory, and the available tools. It does not re-derive the plan. Its only output is either a tool call, a sub-result, or a “step done” / “step failed” status that goes back to the planner.
This is pattern 5 from Anthropic’s agents paper, executed with discipline. The reason it works for long horizons is that the planning load is bounded. The planner is invoked every few executor steps, not every turn. When the planner does run, it gets a clean compressed summary — “step 17 completed; tests pass; next step is to update the README” — not the full history.
The sandbox: shell + browser, nothing else
Devin’s agent lives in a sandboxed VM. Two tool surfaces, in practice, are enough: a shell and a browser.
The shell is what makes it a software engineer. It can cd, ls, git,
run npm install, run pytest, write files, read files. Anything a
human engineer types into a terminal, Devin can type. This is a much
narrower interface than “a thousand custom tools” — the shell is the
universal interface for software work, and the model already knows it
deeply from pretraining.
The browser handles the rest of the world. Docs sites, JIRA, GitHub’s web UI, Stripe dashboards. The browser interface uses DOM observations plus screenshots — Devin sees the page roughly the way a low-bandwidth human would. Cognition has been open about how brittle this surface is and what they invest to keep it working.
Crucially, the sandbox is ephemeral. Each run gets a fresh VM. Persistent state is only what Devin explicitly writes to the Knowledge store or checks back into the repo. This eliminates a whole class of “the agent modified some global thing and the next run is now broken” failures that plague tools that share state across runs.
Memory: explicit windows and hard pruning
The piece of Devin that most reflects the team’s hard-won experience is how it manages memory. The naive approach — append every tool call and result to the conversation — falls over at the 30-minute mark.
Devin keeps several distinct memory layers, each with a clear scope:
| Layer | Holds | Lifetime | Pruning |
|---|---|---|---|
| Working memory | Recent tool calls + results for the current step | Single executor step | Cleared at step boundary |
| Step summaries | One-paragraph distillation of each completed step | Current run | Compressed when plan changes |
| Run scratchpad | Files the agent decided to keep around | Current run | Explicit forget tool clears it |
| Knowledge | User-confirmed facts, past distillations | Across runs | Hand-curated; not auto-appended |
The hard rule is that the executor prompt only ever sees **working memory
- step summaries + the current plan + the current step**. The full trajectory is never re-fed to the model. When the run is long, the older steps are not in the context at all — they exist only as the single-paragraph summaries the planner produced when those steps closed.
This is the architectural answer to context overflow. It is also the architectural answer to drift: by the time step 80 happens, the executor’s prompt is roughly the same shape as the prompt at step 5, just with a different “current step.” There is no slowly-rotting context window to drift inside.
Knowledge: the part Devin makes the user own
Knowledge is where Devin departs sharpest from the “fully autonomous
agent” pitch. The Knowledge layer
is a curated set of facts the user explicitly adds: API tokens to use,
project conventions to follow, “always run make fmt before
committing,” “the production DB is read-only from this machine,” and so
on.
The model does not silently learn these. The user types them in, often after a failed run, to make the next run go better. Cognition’s framing: treat the agent like an onboarded engineer. You’d hand them a runbook; you’d update the runbook when they tripped on something. Same thing.
This is the most honest piece of the architecture. The “agent that learns from feedback” framing breaks down at scale because implicit learning is unreliable and unauditable. Explicit knowledge is editable, reviewable, and shareable across teammates.
user adds a Knowledge entry:
┌────────────────────────────────────────────┐
│ key: db_migration_command │
│ value: always use `pnpm db:migrate`, never │
│ `npx prisma migrate`. Latter clears │
│ the seed data. │
│ tags: backend, db │
└────────────────────────────────────────────┘
│
▼
on the next run touching backend/db, Devin
retrieves matching entries and includes them
verbatim in the planner's system prompt
Failure handling: deterministic checks beat critic agents
A popular pattern in academic agent papers is the “critic” — a second LLM that reviews the first LLM’s work and catches mistakes. Devin mostly doesn’t do this. Cognition has argued publicly that critic agents add cost and rarely catch the failures that matter, because the failures that matter usually involve external state — a test failing, a file not existing, an HTTP 500 — which the world already tells you about, more reliably than any LLM critic can.
So Devin’s recovery loop leans on deterministic signals:
- Tests run after relevant code changes. A failing test is a hard signal the planner uses to re-plan.
- Lint and type check are first-class observations.
- HTTP / shell exit codes are surfaced as structured tool errors, not buried in stdout.
When something does go wrong, the planner gets a structured “step X failed because Y” event and decides whether to retry, work around, or ask the user. The user-prompt is itself a tool — Devin will pause and ask when it doesn’t have a confident path forward. That escape valve is what keeps cost bounded on hard tickets.
What’s actually shipping vs. the launch demo
Worth being honest about: the gap between the March 2024 launch demo and the product that’s running today is real. The launch demo implied zero-supervision tickets going end-to-end. The product that has earned its place in real engineering teams is more like a pair-programmer on a long leash: you brief it on a ticket, it works for an hour, then it returns with a draft PR, a plan-execution trace you can scroll, and the specific places it got stuck and made guesses you should review.
That product is more useful than the demo. The demo sells “the agent that replaces engineers.” The product sells “the agent that does the boring, well-scoped middle of an engineering ticket while the human holds the start (the ticket framing) and the end (the review).” Cognition has been increasingly explicit about this framing in their later SWE-bench writeups.
Why this design is being widely copied
In the year since Devin’s architecture became visible — partly through Cognition’s writing, partly through reverse-engineering — most serious long-horizon agent projects have converged on the same shape:
- Planner / executor split as separate LLM calls. Even OpenHands (formerly OpenDevin), the most prominent open-source clone, has adopted this.
- Sandboxed VM with shell + browser. OpenHands, Cosine Genie, Replit’s agent — same primitives.
- Explicit, scoped memory layers. Replit’s Agent v3 documentation makes the layering visible to the user; Devin’s does the same.
- Knowledge / project files as the durable layer. Cursor’s
.cursorrules, Claude Code’sCLAUDE.md, Devin’s Knowledge — all the same idea: durable, user-editable hints, not implicit learning.
The convergence isn’t accidental. When you ship a long-horizon agent and watch real users hit the limits, you arrive at this shape because it’s the smallest shape that actually works.
What to take away
- Long-horizon agents are a memory architecture problem. The model matters; the model is not the bottleneck. The bottleneck is what you show it and what you keep out.
- Plans must be structured state, not chat history. This is the difference between “the agent decides what to do next” and “you can edit, inspect, and resume the run.”
- Sandboxed shell + browser is the universal tool surface. Custom tools have their place, but the shell already encodes most of what software engineering looks like.
- Knowledge is user-curated, not learned. This breaks the “autonomous agent” marketing story but makes the product trustworthy.
- Deterministic checks beat critic LLMs. When the world is available to give you signal — exit codes, test results, HTTP status — use it. Don’t ask another LLM what the first one got wrong.
The most interesting thing about Devin in 2026 is how un-autonomous it has chosen to be. The team that promised an autonomous engineer chose constraint, structure, and human-editable state at every turn — and that’s what made it ship.
Further reading: Cognition’s posts on SWE-bench, multi-agent design, and computer use are the closest thing to a primary source on this architecture. The OpenHands repo is the most readable open-source implementation of the same shape.