datarekha
Agents May 10, 2026

Coding agents in 2026: Cursor, Devin, Sweep, Aider, Claude Code compared

Two years after Devin launched and froze the term 'AI software engineer' in the popular imagination, here's where everyone actually landed. The dominant tools didn't converge — they specialized, and the ones that won did so by picking a workflow lane and dominating it.

14 min read · by datarekha · coding-agentscursordevinsweepaider

When Devin launched in March 2024 with a slick video and a 13.86% score on SWE-bench, it briefly captured the entire AI-coding imagination. “Autonomous software engineer” was the phrase, and every venture pitch deck for the next eighteen months promised some variant of it. Two years later the conversation has changed completely. Devin still exists, still has paying customers, and has nudged its SWE-bench number up — but it’s not the lodestar product. The space split into specialized lanes, and the products that won did so by picking a workflow and engineering the daylights out of it.

This post compares the five tools that anyone shipping code in 2026 has probably touched: Cursor, Claude Code, Devin, Aider, and Sweep. The lens isn’t features — it’s what workflow the tool actually serves, and where each one wins or loses.

The five lanes that emerged

CODING AGENTS — FIVE LANES, FIVE WORKFLOWSLANE 1In-IDEinline edit,scoped agentCursorcomposer 2.565.7% verifiedLANE 2TerminalCLI sessions,subagentsClaude Code87.6% opus 4.7on verifiedLANE 3Asynccloud agent,long horizonDevin~45% verifiedroutine tasksLANE 4CLI difflocal, git,transparentAider52.7% on editslow token useLANE 5Ticket→PRgithub-first,async PRsSweeppivoted toJetBrains
The space didn’t converge to one product. Each tool dominates a specific workflow lane, and many engineers use three of them in a single day.

Cursor — the IDE won

By far the largest user base, and the tool that finally answered the question “what does a code-native IDE look like.” Cursor’s bet was that most coding doesn’t require an autonomous agent — it requires extremely good inline assistance, with a scoped agent (the Composer) available when you genuinely want longer-running work.

In May 2026, Cursor shipped Composer 2.5 at $0.50 per million input tokens and $2.50 per million output tokens for the Composer model itself, with the “Fast” variant (powered primarily by Claude Sonnet 4.7) at the standard frontier-model rates. The subscription is now stratified: Free, Pro at $20/month, Pro+ at $60, Ultra at $200, and Business/Enterprise on top. Cursor’s Background Agent — the closest thing they ship to a long-running async agent — scores 65.7% on SWE-bench Verified using Sonnet 4.6.

Two architectural decisions are worth calling out:

The Auto mode is the load-bearing product. Cursor routes most user requests through a routing model that picks between cheap and expensive backends. Auto mode requests don’t draw from the user’s credit pool — they’re unlimited on Pro. This is how Cursor controls margin while making the experience feel free at the point of use. The router is the moat.

The Composer is deliberately not Devin. Composer can edit multiple files, run terminal commands, see errors, and iterate. But the user is always in the loop — every accepted change is a visible diff, every command needs an approval (configurable). The product framing is “a junior engineer with their hand on your keyboard,” not “an autonomous agent.” That framing matters, because it means users don’t expect Composer to “finish” — they expect to drive it.

Where Cursor falls over: anything that requires running for an hour without supervision. The product simply isn’t designed for that. The Background Agent gets closer but still routinely requires user intervention; the latency expectations of a synchronous IDE aren’t built for “wait 40 minutes.”

The Cursor economics

A footnote about Cursor’s business model that’s worth pulling out. Cursor doesn’t make money on the frontier model calls — those are roughly pass-through cost. Cursor makes money on the router that decides which model to use for which request. Auto mode users effectively delegate that decision to Cursor; Cursor then optimizes per-request for cost while maintaining quality.

This is why Auto requests don’t count against the user’s credit pool on Pro. From Cursor’s side, an Auto request that gets routed to a cheap model costs Cursor less than the user’s monthly subscription contributes. The router is the unit economics; the subscription is the constraint.

The implication for users: Cursor’s incentive is to maximize the fraction of requests routed to Cursor-proprietary models (Composer, Sonic) rather than frontier APIs. Composer 2.5 is priced lower than Sonnet for exactly this reason — Cursor wants users picking Composer when it works as well, because each Composer call has better unit economics for Cursor than a Sonnet call. The Composer-2.5 quality is good enough that, for many tasks, it’s a fair recommendation. For others, it’s a quality compromise the user might not notice.

Claude Code — the terminal renaissance

The dark horse that became dominant. Claude Code was Anthropic’s bet that a serious chunk of the coding-agent market wanted to live in the terminal, in their own dev environment, with access to their own tools. The product launched as a CLI and stayed there — even the recent Agent View dashboard (shipped May 12, 2026) is a thin wrapper over multiple terminal sessions, not a separate UI paradigm.

Claude Code’s distinguishing technical feature is its subagent system. A subagent is a reusable configuration — custom system prompt, scoped tool list — defined in YAML, that can be invoked by the parent session. The parent agent dispatches sub-tasks to subagents and gets back compact results, which keeps the parent’s context window from filling up with tool-call noise. It’s the orchestrator-workers pattern (from the five-patterns post) made first-class.

On SWE-bench Verified, Claude Code running on Claude Opus 4.7 hits 87.6%, with the Mythos Preview pushing to 93.9%. These are eye-watering numbers but require the caveat that OpenAI has stopped reporting Verified on the grounds of training contamination — every frontier model now scores so well on Verified that the benchmark has saturated and Pro is the more discriminating successor.

The product moves that matter:

  • Background sessions survive terminal closure. A supervisor process runs the agent independently, so closing the laptop doesn’t kill the work. This was the feature gap that previously forced users to Devin for any non-trivial run.
  • Agent View dashboard lets one developer drive multiple concurrent sessions, each on a separate task. The shift, in Anthropic’s framing, is from “one-on-one conversation” to “one-to-many dispatching.”
  • Separate Agent SDK credit pool (effective June 15, 2026) — programmatic use of Claude Code now draws from a different bucket than interactive use, which makes the cost of building Claude-Code-powered products predictable.

Where Claude Code loses: graphical context. If your work is heavily visual (Figma reviews, frontend pixel-pushing, ad debugging) the lack of an integrated IDE view is awkward. Many users run Claude Code alongside Cursor for exactly this reason.

Where Cursor’s Composer falls short

The product team has been explicit that Composer is not Devin. That framing has limits. A Composer task that should take 10 minutes can spiral into 40 minutes when the agent encounters an unfamiliar build system or a flaky test. Without the “walk away” property of a fully async product, the user is sitting there watching it grind. Many Cursor power users describe their workflow as “start Composer, switch to a different terminal, come back in 20 minutes” — which is exactly what async agents are designed for, executed through a synchronous-feeling UI.

The Composer 2.5 release tries to address this with the Background Agent — a Cursor-managed cloud runner that genuinely operates asynchronously. Early reviews suggest it works well for scoped maintenance tasks but isn’t yet as autonomous as Devin for open-ended engineering work. The competitive bet is clear: Cursor wants to own both the synchronous IDE and the async background loops, with the same workspace context across both.

What Cursor still hasn’t cracked

For all of Cursor’s success, two open problems remain. First: cross-repo work. Cursor is excellent at single-repo coding but doesn’t yet have a great story for tasks that span multiple repositories (a service change plus a client library update, say). Engineers working on microservices feel this gap routinely.

Second: long-running context. Background Agent helps, but for tasks that span days — refactor a service over the course of a sprint — Cursor doesn’t yet have memory across separate sessions in the way that Devin’s session persistence provides. Anthropic’s “Projects” feature in Claude has some of this; Cursor’s roadmap suggests it’s coming, but as of mid-2026 the multi-day workflow lives elsewhere.

These gaps are real and persistent. Cursor’s strategy seems to be “extend gradually rather than rebuild” — which has worked so far but means certain workflows live outside Cursor for longer than users would like.

Devin — the original promise, scoped down

The most-discussed and most-criticized product. Devin’s pitch was always full autonomy — give it a Linear ticket, walk away, come back to a PR. Two years of iteration later, Devin 2.0 does deliver on this for a meaningful subset of tasks, but the realistic scope is much narrower than the original demo suggested.

What Devin is good at in 2026:

  • Routine maintenance tasks: dependency upgrades, config migrations, test scaffolding.
  • Codebases it already knows — Devin has session memory across runs and gets noticeably faster on familiar repositories.
  • Async, multi-hour work that doesn’t need a human in the loop and where the acceptance criteria are mechanical (tests pass, lints clean).

What it still struggles with:

  • Anything ambiguous about the goal. Devin’s recovery loop (mode 3 in our failure modes post) is its most visible failure pattern — it can spend hours yak-shaving environment setup if it can’t reach a clean starting state.
  • Novel architectural decisions. Devin executes plans well; it does not author them.
  • The “last 10%” of any non-trivial task, which often requires judgment calls Devin can’t make and humans usually need to wrap up.

The SWE-bench Verified self-reported number for Devin 2.0 sits in the mid-40s. Cognition’s own positioning has moved away from “AI software engineer” toward “agent computing platform,” with a focus on Agent Compute Units as the billable primitive and a heavier emphasis on enterprise-fleet management. The product is still autonomous, but the marketing has matured into “autonomous for routine tasks, with a human reviewer for anything else.”

Pricing is around $500/month for the base subscription with metered ACU usage on top. For a development team doing high-volume routine work — automated migrations, scheduled refactors — this can pencil out. For solo developers, the per-task cost relative to Cursor or Claude Code rarely justifies the autonomy.

The “AI software engineer” framing problem

A meta-point about Devin’s positioning. The 2024 launch video framed Devin as “the first AI software engineer” — a discrete entity that takes over the role. This was great marketing and turned out to be a strategic mistake.

The “AI software engineer” framing implicitly creates the expectation that Devin should be able to do what a human engineer can do. When Devin produces a wrong PR or burns six hours on a doable task, the comparison is unflattering. The Cursor framing — “your AI pair programmer” — sets a fundamentally easier expectation, because pair programmers aren’t expected to operate without you.

The realistic 2026 framing for Devin (which Cognition’s own marketing has been migrating toward) is “an agent computing platform” — a runtime for autonomous coding workflows that customers compose to their needs. This is less catchy but matches what the product actually does well. The customers getting genuine value from Devin are running it on tightly-scoped, repeatable workflows, not asking it to be a fungible engineer.

Devin’s strongest use case

Strip away the hype and Devin’s strongest 2026 use case is fleets of routine work. A consulting firm that needs to migrate 200 client projects from Python 3.8 to 3.12, with similar patterns of changes across each, is exactly the kind of task Devin does well. Spin up 50 parallel Devin instances, one per project, let them work overnight, review the PRs in the morning, kick the failures back into the queue.

For this kind of work, Devin’s autonomy is the differentiator. Cursor wouldn’t help (you’d need 50 engineers driving 50 IDEs); Claude Code’s session persistence helps but still requires babysitting. Devin’s “fire and forget” property is genuinely valuable when the same task has to be done a hundred times.

The market segment Cognition has settled into reflects this — enterprise platform sales, with Devin pitched as the runtime for batch coding workflows, not as a one-off “personal AI engineer” for individual developers. The pricing ($500/month base plus ACU usage) makes sense for teams doing high-volume routine work; less so for solo developers.

Aider — the principled minority

Aider remains the favorite of engineers who want to see exactly what the model is doing. It’s a CLI, it edits files inside your git checkout, it auto-commits with descriptive messages, and the entire workflow is “describe, review the diff, commit.” Aider’s benchmark page reports a 52.7% combined score with notably efficient token usage — about 4x less token consumption than Claude Code for comparable tasks, per Morph’s 2026 comparison.

The reasons engineers stay on Aider despite the flashier options:

  • Git-native by design. Every change is a real commit. You can git revert any AI change with standard tools, no special UI needed.
  • Model-agnostic. Aider works with GPT-5, Claude, Grok-4, Gemini, and local models with first-class support. If you’re cost-sensitive or running an on-prem model, Aider is one of the very few good options.
  • Repomap. Aider builds a compact map of the codebase that the model uses to stay oriented. The token cost is low and the orientation quality is high.
  • The diff is the product. Aider doesn’t try to be smart about hiding the model’s edits. Every change is visible as a unified diff before commit. For people whose mental model of coding is “I read diffs all day,” this fits.

Where Aider feels its age: no IDE integration, no background sessions, no async cloud agents, no subagent orchestration. It’s deliberately a single-track CLI experience. For engineers who want exactly that, it’s still the best option. For everyone else, it’s a complement to other tools rather than a primary.

Where Claude Code is heading

The May 2026 Agent View release hints at where Anthropic is taking Claude Code: from a single-session terminal tool to a fleet-orchestration platform. The Agent View dashboard lets one developer manage multiple concurrent agent sessions, each on a separate task, with task delegation across them. This is starting to look less like a coding tool and more like an agent operating system — a layer at which developers schedule work across many AI agents.

Combined with the separate Agent SDK credit pool (which makes programmatic Claude Code usage predictable for product builders), the trajectory is clear. Claude Code is becoming the substrate for building agentic products, not just for human developers’ immediate use. The Anthropic bet is that “agents that build with other agents” will be a meaningful category, and Claude Code is positioning to be the runtime.

This is a different bet than Cursor’s (which is firmly “human developer at the keyboard, AI assistant”) and from Devin’s (which is “AI engineer replaces human”). The Claude Code framing is closer to “orchestrate AI agents the way you orchestrate processes” — and over time, the human-as-user-of-Claude-Code might be a minority workload compared to Claude-Code-as-runtime-for-other-software.

Sweep — the pivot

Sweep’s original 2023 pitch was the cleanest “AI software engineer” framing: tag a GitHub issue with sweep, the agent reads your code, plans the change, and submits a PR with tests. For routine bug fixes and small features, it worked surprisingly well. They had real production customers.

In 2025 Sweep pivoted to JetBrains, positioning themselves as a coding assistant built for IntelliJ, PyCharm, and the broader JetBrains family. The new product looks less like an autonomous PR-bot and more like an IDE-integrated assistant in the Cursor mold, but optimized for the JetBrains ergonomics that Cursor (which forked VSCode) doesn’t serve.

This pivot is interesting because it confirms the lane-specialization thesis. The original ticket-to-PR workflow turned out to be a real but narrow market, and the IDE-assistant lane was both larger and underserved on JetBrains specifically. The teams that bet on “general autonomous coding agent” mostly didn’t find product-market fit; the teams that picked a workflow and over-served it did.

The benchmark saturation problem

Worth a brief detour into why SWE-bench Verified scores have stopped being a useful signal. When the benchmark was introduced in 2024, top scores were in the high teens; Devin’s launch claim was 13.86%. Through 2025, frontier models climbed steadily — Claude 3.5 Sonnet hit ~49%, GPT-4o around 33%, and by mid-2025 multiple models were above 60%.

Then OpenAI publicly stepped away from Verified, noting that every frontier model showed evidence of training-data contamination on the dataset. The remediation was SWE-bench Pro, a held-out benchmark with novel issues. Scores on Pro are much lower across the board (46% being a top score where Verified pushes 90%+) and the ranking of models shifts substantially.

What this means for tool selection:

  • Take any 2026 SWE-bench Verified score above 80% with skepticism. It’s likely measuring contamination as much as capability.
  • Pro scores are more discriminating but less mature. Fewer tools have been evaluated; the leaderboard is still settling.
  • Workflow fit beats raw capability for most users. The 87% Claude Code score on Verified vs. the 65% Cursor score doesn’t translate to Claude Code being 35% more useful in an IDE workflow. Both are excellent; the difference is environment.

A practical heuristic: pick the tool that fits your daily workflow, and only switch on benchmark scores when the gap is dramatic (say, 30+ points on a fair comparison). For most engineers in 2026, the right answer is “two or three tools, each used in their lane.”

How the agents compare on token economy

A specific operational angle: token efficiency varies dramatically across these tools. For comparable tasks, Morph’s 2026 benchmark put Aider at roughly 126K tokens per task while Claude Code consumed 4-5x more on the same work. Cursor’s Composer 2.5 sits somewhere in the middle. Devin’s token usage is harder to compare directly because it operates on different task types.

For users on metered pricing, this matters. A Claude Code session that runs on Sonnet 4.6 at $3/MTok input and $15/MTok output can rack up $5-$15 per hour of intensive use. Aider’s lower token consumption makes the same model substantially cheaper to operate. For high-volume use, the per-token savings compound into meaningful annual cost differences.

The token efficiency reflects design choices. Aider’s repomap is a compact representation; Claude Code’s verbose tool-call logging keeps full context for debuggability. Both are defensible choices for their respective use cases. But users should know what their tools cost per task — the variance is significant.

A footnote on Aider’s persistence

It’s worth pausing on why Aider is still relevant in 2026 despite being one of the older tools in the space. Aider has not pivoted, not raised major outside funding, and not chased the IDE or async-cloud-agent lanes. It has stayed exactly what it was: a small, focused CLI tool that does diff-driven editing well.

This is an underrated success pattern. The Linux ecosystem is full of tools that started small, did one thing well, and survived decades without changing their shape. Aider seems to be following that template in the AI-coding space. As the flashier products iterate through major UI revamps and pricing changes, Aider just keeps shipping releases that add model support and refine the diff-edit loop.

For the user, this stability is a feature. The same Aider workflow that worked in 2024 works in 2026 with newer models plugged in. That predictability has value, especially for engineers who treat AI as one tool among many in their existing CLI-driven workflow.

The economics of “AI engineer” vs. “AI assistant”

A pricing-and-positioning observation: Devin’s $500/month base subscription versus Cursor’s $20/month Pro plan reflects a real disagreement about what these products are. Cursor positions itself as a tool that augments the human engineer (priced like a SaaS tool); Devin positions itself as a substitute for engineering capacity (priced more like a contractor).

The market verdict, in 2026, is that the augmentation framing wins for individual developers and the substitution framing wins for specific high-volume enterprise workflows. Cursor’s user count is much larger; Devin’s per-customer revenue is much higher. Both can be successful businesses simultaneously by serving different parts of the market.

A useful prediction: as the agents get better, the “AI engineer” pricing will become viable for more workflows, but the framing will shift from “replace a developer” to “perform a job that didn’t justify a dedicated developer before.” That’s the larger market, and the one where autonomy actually pays off — work that’s important enough to do but not important enough to staff.

A note on the JetBrains question

Sweep’s pivot to JetBrains highlights a real market gap. Cursor forked VSCode, and the deep IDE-replacement story therefore lives in the VSCode lineage. JetBrains users (Java, Kotlin, Python in PyCharm) have had a much rougher AI integration story — JetBrains’ own AI Assistant is competent but not dramatic, and third-party plugins have limited reach into the IDE internals.

Sweep’s bet is that the JetBrains population is large enough (millions of paying users in the enterprise) and underserved enough to support a dedicated coding-agent product. Early indications suggest the bet is paying off in specific verticals — particularly Java/Kotlin enterprise shops that have standardized on IntelliJ for the last decade and aren’t going to migrate to VSCode for AI reasons.

The broader lesson: even in 2026, the “where the user sits” question is as much about the IDE as about the AI capability. Tools that match the user’s existing environment win over tools that ask the user to migrate, even when the migration target is technically superior. Cursor learned this by being a VSCode fork; Sweep learned it by leaving the GitHub-first lane for the JetBrains-first one.

What about local-model coding agents

Worth a paragraph on the local-model story. Aider is the most prominent local-model-friendly tool but not the only one. Cline (a VSCode extension), Continue, and various Ollama-integrated terminal tools all support running coding agents against local Llama, Qwen, or DeepSeek models.

Local models have improved dramatically through 2025-2026. DeepSeek-V4 and Qwen-3 produce coding output that’s competitive with mid-tier hosted models for many tasks. For teams that can’t ship code to external APIs — defense contractors, on-prem regulated industries, certain enterprise IT environments — local-model agents are a real option.

The trade is reduced capability at the high end. The best local models still trail Claude Opus and GPT-5 on complex multi-file refactors and difficult debugging tasks. For routine work — adding tests, writing documentation, simple bug fixes — local models are perfectly adequate. The local-model tools are usually invisible in the leaderboards because they’re not benchmarked uniformly, but their installed base is growing meaningfully in environments where remote APIs are off the table.

A workflow comparison

WHICH TOOL FOR WHICH JOB”fix this typo, add a docstring, tighten this loop”Cursor / Aider”refactor this 200-file directory while I’m at lunch”Claude Code bg / Devin”convert these 30 Linear tickets to PRs overnight”Devin”on-prem model, can’t ship code to the internet”Aider + local LLM”big design discussion, need to see and edit many files”Claude Code or Cursor
The honest 2026 answer to “which coding agent?” is “it depends on the task” — and most senior engineers now have three or four installed and use them interchangeably.

The category that didn’t happen — the universal coding agent

Worth a paragraph on the dog that didn’t bark. The 2024 prediction was that one of these products would consolidate into “the” coding agent — the universal tool that everyone uses for everything. It didn’t happen. The closest thing to a universal product is probably Cursor (largest installed base) but it’s clearly not the right tool for every workflow.

Why didn’t consolidation happen? Three reasons, roughly:

  • The workflows are genuinely different. Inline IDE editing, async cloud agents, terminal sessions, and ticket-to-PR are not the same job. A product optimized for one is sub-optimal for the others. The cross-product synergies are real but not strong enough to outweigh the workflow-specific advantages.
  • The model itself is mostly commoditized. All five products run primarily on Claude or GPT under the hood. The model isn’t the moat. The moat is the workflow integration, and workflow integration doesn’t scale across radically different workflows.
  • Distribution channels diverge. Cursor’s distribution is the IDE replacement. Claude Code’s is the terminal. Devin’s is the dashboard. Aider’s is the CLI/PyPI ecosystem. Each channel attracts different users and rewards different product shapes.

This pattern — workflow-specific dominance rather than category-wide consolidation — is what mature software markets often look like. The IDE space has had VSCode, IntelliJ, Vim, and Emacs co-existing for years. The coding-agent space appears to be settling into a similar steady state.

The 2026 pattern

What’s interesting about the current state isn’t any single product — it’s the shape of the market. Two years ago, every coding-AI pitch was “we’re building the AI software engineer.” Today, the products that found product-market fit explicitly aren’t trying to be one tool that does everything:

  • Cursor = IDE-first, agent-when-asked.
  • Claude Code = terminal-first, multi-session, agent-orchestrator-friendly.
  • Devin = async-first, autonomous-but-narrow.
  • Aider = transparent diffs, model-agnostic, CLI-native.
  • Sweep = JetBrains-native IDE assistant.

The teams that tried to build “the one true coding agent” mostly didn’t find traction. The teams that picked a workflow lane and engineered the experience deeply within it — that built for the specific texture of how that workflow actually feels — own their respective lanes.

Three takeaways for anyone evaluating coding agents in 2026:

  • There is no winner. There are five winners. Most engineers I know are using at least two of these tools in any given week, sometimes simultaneously. Cursor for inline edits, Claude Code for background refactors, Devin for batch ticket processing.
  • SWE-bench Verified is saturated. When the top scores cluster above 85% and OpenAI has formally moved on, the benchmark stopped being the deciding factor for tool choice. The differentiator is workflow fit, not raw capability.
  • The IDE is not dead. It just turned out that the IDE wins for the workflow where you’re already in an IDE, and other workflows want other tools. The 2024 prediction that “agents would replace IDEs” was the wrong frame; the right frame is “agents extend each workflow surface.”

The most surprising thing about 2026 is how mundane this all looks compared to the 2024 hype cycle. The tools work, the engineers use them, the headlines have moved on. That’s what successful technology adoption looks like in the boring middle — and the coding agent space is the first slice of AI to genuinely cross into that middle.

The one thing the 2024 narrative did get right: coding work has changed substantively. The “10x engineer” debate that consumed tech Twitter for a decade has been quietly settled by tooling — engineers using these agents well are genuinely multiples more productive than engineers who aren’t. The difference isn’t model intelligence; it’s workflow fluency. The engineers who learned to drive Cursor’s Auto mode, Claude Code’s subagents, and Aider’s diff-driven loop are operating at a higher level than those still treating AI as a marginal autocomplete improvement. Two years from now, that gap will likely widen rather than narrow.


Further reading: the SWE-bench Verified leaderboard, Anthropic’s Claude Code product page, Cognition’s Devin AI guide for 2026, Aider’s benchmark documentation, and a comprehensive coding agent comparison from Morph. For the deeper benchmark debate, see Morph’s SWE-bench Pro analysis.

Skip to content