datarekha
Deep Learning Medium Asked at GoogleAsked at MetaAsked at Amazon

Why do vanilla RNNs struggle with long sequences?

The short answer

Vanilla RNNs suffer from vanishing (and exploding) gradients when backpropagating through many time steps, which prevents them from learning dependencies that span more than a handful of tokens. They are also inherently sequential — each step depends on the previous hidden state — so they cannot be parallelised during training.

How to think about it

A vanilla RNN updates its hidden state as h_t = tanh(W_h * h_{t-1} + W_x * x_t + b). At training time, gradients are propagated back through every time step via backpropagation through time (BPTT). The gradient with respect to an early hidden state involves a product of Jacobians across all subsequent steps:

∂L/∂h_1 = (∂L/∂h_T) * ∏_{t=2}^{T} (∂h_t/∂h_{t-1})

Each Jacobian factor is W_h^T * diag(1 - tanh²(·)). When the spectral norm of W_h is less than 1, this product shrinks exponentially with sequence length — the vanishing gradient problem. When it exceeds 1, gradients explode.

Practical consequences:

  • The network forgets information from early positions; it learns only short-range patterns.
  • Gradient clipping is a common but imperfect fix for explosion; vanishing is structurally harder to address.
  • Because h_t depends on h_{t-1}, every step must be computed in order — training a length-1000 sequence cannot be parallelised across those 1000 steps, making RNNs slow on modern hardware.

LSTM and GRU gates mitigate vanishing gradients by providing additive (not multiplicative) update paths for cell state. Transformers replace recurrence entirely with self-attention, which connects any two positions in a constant number of operations and is fully parallelisable.

Learn it properly Self-attention

Keep practising

All Deep Learning questions

Explore further

Skip to content