datarekha
Deep Learning Medium Asked at GoogleAsked at MetaAsked at OpenAI

What causes exploding gradients and how is gradient clipping a fix?

The short answer

Exploding gradients happen when the product of layer Jacobians has spectral norm greater than 1, causing gradients to grow exponentially with depth. Gradient clipping rescales the gradient norm to a maximum threshold before the weight update, preventing divergence without discarding gradient direction.

How to think about it

Exploding gradients are the mirror image of vanishing gradients. When weight matrices have large singular values, the Jacobian products amplify rather than shrink the gradient signal. The result is weight updates of magnitude 10^6 or larger, which overshoot any minimum and cause NaN loss values.

Most common context: recurrent networks (RNNs, LSTMs) processing long sequences — the same weight matrix W is multiplied at every time step, so W^T is applied hundreds of times in backprop-through-time (BPTT).

Diagnosis:

total_norm = 0.0
for p in model.parameters():
    if p.grad is not None:
        total_norm += p.grad.data.norm(2).item() ** 2
total_norm = total_norm ** 0.5
print(f"Gradient norm: {total_norm:.4f}")
# If this regularly exceeds 10–100, you have an exploding gradient problem

Gradient clipping — norm clipping (preferred):

# After loss.backward(), before optimizer.step()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()

Norm clipping scales the entire gradient vector so its L2 norm does not exceed max_norm. The direction of the gradient is preserved; only its magnitude is capped. This is strictly better than value clipping (which clips elementwise and can distort direction).

Other mitigations:

  • He / orthogonal initialization reduces initial spectral radius.
  • Smaller learning rate buys stability but does not fix the root cause.
  • For RNNs: LSTM / GRU gates provide a learned mechanism to regulate how much signal flows back in time.
Learn it properly Gradient descent

Keep practising

All Deep Learning questions

Explore further

Skip to content