What is gradient clipping and when would you use it?

Gradient clipping caps the magnitude of gradients (by value or by global norm) before the optimizer step, preventing exploding gradients that cause unstable or diverging training. It is especially useful in RNNs and transformers, where a single large update can destabilize learning.

What is the vanishing gradient problem, and how do you address it?

Vanishing gradients occur when gradients shrink toward zero as they propagate back through many layers, so early layers learn extremely slowly or not at all; it is common with sigmoid or tanh activations in deep networks. Mitigations include ReLU-family activations, residual/skip connections, batch or layer normalization, careful initialization, and gated architectures like LSTMs.

What causes exploding gradients and how is gradient clipping a fix?

Exploding gradients happen when the product of layer Jacobians has spectral norm greater than 1, causing gradients to grow exponentially with depth. Gradient clipping rescales the gradient norm to a maximum threshold before the weight update, preventing divergence without discarding gradient direction.

What is the vanishing gradient problem and how do you fix it?

Vanishing gradients occur when repeated multiplication of small derivatives during backpropagation drives gradients toward zero, starving early layers of learning signal. The main fixes are better activations (ReLU/GELU), residual connections, batch normalization, and careful weight initialization.

Vanishing & exploding gradients — Deep Learning

Backprop computes a node’s gradient by multiplying the downstream gradient by a local derivative — over and over, once per layer, all the way back to the input. That repeated multiplication is the whole danger. Multiply twenty numbers that are each 0.5 and you get 0.5²⁰ ≈ 0.000001. Multiply twenty numbers that are each 1.5 and you get 1.5²⁰ ≈ 3325. The chain rule turns depth into an exponential, and the gradient either vanishes to nothing or explodes to NaN.

TryGradient flow · backward through 15 layers

Where do gradients go to die?

A real backward pass. Each bar is the gradient norm at that layer; gradients enter at the output (right) and flow to the input (left). With sigmoid, every layer multiplies by a factor below one, so by the time gradients reach the early layers they've nearly vanished. Crank the weight scale to make them explode instead — then clip.

weight scale 0.60

input layergradient norm (log)output layer →

early/late ratio = 0.002. Vanishing — early layers get almost no gradient, so they barely learn. This is why deep sigmoid/tanh nets needed ReLU, good init, and residual connections.

See it happen, layer by layer

Watch the gradient norm at each of 15 layers. Gradients enter at the output and flow left toward the input. With sigmoid, whose derivative never exceeds 0.25, the early (input-side) layers are starved of gradient — they barely learn. A too-large weight scale explodes it the other way; only a healthy setup holds it near 1:

The key reading: the input-side layers (left) are where gradients are weakest. That is why, before modern tricks, deep networks “couldn’t train their early layers” — the learning signal evaporated before it reached them.

Vanishing: why it happens, how to fix it

The per-layer multiplier is roughly ‖W‖ × (activation derivative). Two things make it shrink below 1:

Saturating activations. Sigmoid and tanh flatten out for large inputs; their derivative goes to ~0 there. ReLU’s derivative is exactly 1 for positive inputs — it doesn’t shrink the gradient. This is the single biggest reason ReLU replaced sigmoid in deep nets.
Too-small weights. Covered in weight init — the wrong scale decays the signal (and its gradient) every layer.

The modern toolkit that keeps gradients alive through hundreds of layers:

ReLU-family activations (ReLU, GELU, SiLU) — non-saturating.
Good initialization — He/Xavier so the chain starts near 1.
Residual connections — x + f(x) gives the gradient a direct path that skips the multiplications (the reason ResNets and transformers go deep).
Normalization (BatchNorm, LayerNorm, RMSNorm) — rescales activations each layer so the chain can’t drift far from 1.

Exploding: clip the gradient

Exploding gradients are the opposite — common in RNNs and with too-large weights or learning rates. The standard fix is gradient clipping: before the optimizer step, if the total gradient norm exceeds a threshold, rescale the whole gradient down to that threshold. Direction is preserved; only the magnitude is capped.

import numpy as np

# Pretend these are the gradients of three parameter tensors after backward().
grads = [np.array([3.0, 4.0]), np.array([12.0, 0.0]), np.array([0.0, 5.0])]

# Global L2 norm across ALL parameters (this is what PyTorch clips).
global_norm = np.sqrt(sum((g**2).sum() for g in grads))
print(f"global grad norm = {global_norm:.2f}")

max_norm = 5.0
if global_norm > max_norm:
    scale = max_norm / (global_norm + 1e-6)
    grads = [g * scale for g in grads]      # rescale EVERY gradient by the same factor
    print(f"clipped: scaled all grads by {scale:.3f}")

new_norm = np.sqrt(sum((g**2).sum() for g in grads))
print(f"new global norm = {new_norm:.2f}  (direction unchanged)")

global grad norm = 13.93
clipped: scaled all grads by 0.359
new global norm = 5.00  (direction unchanged)

In PyTorch this is one line, placed after backward() and before step():

loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()

In one breath

Backprop multiplies a per-layer factor at every layer, so depth turns the chain rule into an exponential: a factor below 1 vanishes the gradient, above 1 explodes it.
The input-side (early) layers are hit hardest by vanishing — the signal evaporates before it reaches them.
Vanishing fixes: non-saturating activations (ReLU/GELU/SiLU), good init (He/Xavier), residual connections (a multiply-free path), and normalization.
Exploding fix: clip the global gradient norm — rescale all gradients by one factor when the norm exceeds a threshold, placed between backward() and step().
The single most useful number to watch is the global gradient norm: climbing → exploding (clip, lower lr); decaying while loss plateaus → vanishing (check activations, init, residuals).

Quick check

0/3

Q1Why does depth cause vanishing or exploding gradients?

Q2Why did ReLU help so much with vanishing gradients compared to sigmoid?

Q3What does gradient clipping (clip_grad_norm_) do, and when do you call it?

You can now keep a deep net’s gradients alive and bounded. Next, the knobs that set the step size those gradients drive: optimizers, learning-rate schedules, and how batch size and learning rate interact.

Vanishing & exploding gradients

What you'll learn

Before you start

Where do gradients go to die?

See it happen, layer by layer

Vanishing: why it happens, how to fix it

Exploding: clip the gradient

In one breath

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further