datarekha
LLMs June 8, 2026

Attention is O(n²) — and Mamba's linear escape

Attention costs O(n²), so long context gets expensive fast. State-space models like Mamba scale linearly — and 2026's winning architectures are hybrids of both.

9 min read · by datarekha · llmarchitecturemambastate-space-modelsattention

Self-attention is the engine of every transformer, and its superpower is that any token can look directly at any other token in the context. That is exactly why transformers are so good at recalling a name from 50,000 tokens ago. But that superpower has a price tag written in the math, and at long context lengths the price becomes the whole story.

The quadratic wall

To compute attention, every token compares itself against every other token. For a sequence of length n, that is n × n comparisons — the cost grows as O(n²). Double the context and you do not double the work; you quadruple it. The memory for the attention scores grows the same way.

For a 2,000-token prompt nobody notices. For the million-token context windows that became a 2026 selling point, that quadratic term is a wall: the compute and memory to attend across a million tokens is astronomical, which is why so much engineering goes into approximating, compressing, or avoiding full attention.

State-space models take the other road. Instead of comparing every token to every other, a model like Mamba processes the sequence like a recurrence: it carries a fixed-size hidden state forward, updating it one token at a time. Cost grows as O(n) — linear. Double the sequence, double the work. No wall.

Cost vs sequence lengthcompute & memorysequence length, tokens →attentionO(n²)MambaO(n)the gap at1M tokens

So why didn’t Mamba just win?

Because the fixed-size hidden state is both the gift and the catch. A transformer keeps everything — the full key-value memory of the whole context — and can reach back to any exact token. Mamba compresses the past into a small running state, which is wonderfully cheap but inevitably lossy. When the task is “find the one sentence 800K tokens ago that contradicts this one,” the model needs the kind of precise, arbitrary-position lookup that attention does naturally and a compressed state does not.

That trade-off shows up on benchmarks. Pure state-space models tend to lag transformers by several points on standard reasoning and in-context retrieval tasks, precisely because so much of what those benchmarks test is exact recall. Linear scaling is a fundamental advantage for million-token sequences; it is a fundamental disadvantage when you need a perfect memory.

The 2026 answer: don’t choose

The synthesis that actually won is not “attention vs Mamba” — it is both, interleaved. Hybrid models stack many cheap linear (Mamba-style) layers and sprinkle in a few full-attention layers. The linear layers carry the bulk of the sequence processing at O(n) cost; the occasional attention layer restores the precise, look-anywhere recall the linear layers lack. The consensus that emerged is that pure SSMs were never going to replace transformers, and hybrids are the right synthesis — frequently matching or beating both parent architectures while costing far less than full attention.

The clearest production example is Jamba, which interleaves Mamba layers with transformer-attention layers and adds a mixture-of-experts for parameter efficiency — getting linear-ish scaling on long inputs without surrendering the recall that pure SSMs give up.

Why this matters for you

For everyday chat and RAG under ~128K tokens, pure transformers still win on quality, and you may never touch a state-space model directly. But the moment your workload is genuinely long — whole codebases, hour-long transcripts, genomics, agents carrying massive memory across a session — the quadratic term decides your bill, and hybrids are how the field is making those workloads affordable. The transformer is not being replaced. It is being budgeted: kept where its perfect recall earns its O(n²) price, and swapped for linear layers everywhere else. Understanding why attention costs what it does is what lets you reason about which architecture a long-context problem actually needs.

Skip to content