datarekha
Deep Learning Medium Asked at GoogleAsked at OpenAIAsked at MetaAsked at Anthropic

What are the concrete reasons transformers outperform RNNs on most sequence tasks?

The short answer

Transformers win on three axes: parallelism (no sequential dependency lets all positions train simultaneously on GPUs), path length (any two tokens interact in O(1) layers, not O(n) steps), and scalability (attention over longer contexts keeps improving with more compute, while RNN quality degrades with sequence length despite training costs).

How to think about it

The comparison across three critical dimensions:

DimensionRNN / LSTMTransformer
Training parallelismSequential — step t waits for t-1Fully parallel across all positions
Long-range dependency pathO(n) multiplicative stepsO(1) attention steps
Gradient flowVanishes / explodes over distanceAdditive residuals; stable
Memory at inferenceFixed-size hidden stateFull KV cache (grows with context)
Context length scalingPractically capped ~1k tokensScales to 128k+ with engineering

Parallelism is the biggest practical win. Modern hardware (GPUs, TPUs) thrives on matrix multiplications that can be batched across all sequence positions at once. An RNN over length 1024 requires 1024 sequential matrix-vector products; a transformer requires one large batched matrix-matrix multiplication — the latter is 10–100× faster on hardware.

Path length determines learnability of long-range dependencies. For a transformer, the gradient between token 1 and token 1000 passes through at most 2N sub-layers (N encoder layers, each with add-and-norm shortcuts). For an RNN, it passes through 999 multiplicative Jacobian products.

Scalability. Transformers follow smooth scaling laws (Chinchilla): double the compute, get predictable improvement. RNNs plateau earlier because quality degrades with sequence length before you can scale.

Learn it properly Self-attention

Keep practising

All Deep Learning questions

Explore further

Skip to content