Diffusion language models: when AI writes text all at once

Every famous LLM — the ones writing your emails and your code — shares one deep habit: they write strictly left to right, one token at a time. To produce token five, the model must have already produced tokens one through four. This is autoregressive generation, and it has a hard consequence: a 500-token answer takes 500 sequential steps. You cannot parallelize your way out of it, because each token literally depends on the last.

Diffusion language models ask a heretical question: what if you generated the whole answer at once?

Borrowing the idea that made image generation explode

Image diffusion models start from pure noise and repeatedly denoise it into a coherent picture, refining every pixel in parallel over a series of steps. Diffusion language models port that idea to text. They start from a fully masked sequence — every position is a blank — and then, over a handful of denoising steps, iteratively unmask tokens, refining all positions simultaneously rather than producing them one at a time.

The key difference is where the sequential cost goes. An autoregressive model needs one step per token. A diffusion model needs one step per denoising pass — and there are only a small, fixed number of those, no matter how long the output is. Generate 50 tokens or 500, the number of steps barely changes.

The 2026 payoff: speed

For years this was a research curiosity — text is discrete and diffusion was born for continuous data like pixels, so quality lagged. That changed. Commercial diffusion LLMs arrived and the headline was raw throughput: Inception Labs’ Mercury 2 reached over 1,000 tokens per second, roughly 10x the fastest autoregressive models. And it was not speed at the cost of being useless — a 7B Mercury model hit 1,109 tokens/sec at 71.9 on MMLU, versus a comparable autoregressive model at 240 tokens/sec and 73.1: several times faster, barely more than a point behind on quality.

There is an architectural twist worth noting. Because a diffusion model attends to the whole sequence bidirectionally at every step and does not generate left to right, it does not use a KV cache the way autoregressive models do — sidestepping one of the biggest cost centers of long-context serving, though it pays in its own way by reprocessing the full sequence each step.

The takeaway

The left-to-right habit was never a law of nature — it was a choice, and a costly one for latency. Diffusion language models show there is another way to put words on the page: all at once, refined over a few parallel passes. Whether they become the default or stay the fast specialist, they are the clearest reminder of 2026 that the architecture we think of as “how LLMs work” is just one point in a much larger design space — the same lesson as Mamba’s challenge to attention.