What is the vanishing gradient problem, and how do you address it?

For ML Engineer research-engineer Data Scientist

The short answer

Vanishing gradients occur when gradients shrink toward zero as they propagate back through many layers, so early layers learn extremely slowly or not at all; it is common with sigmoid or tanh activations in deep networks. Mitigations include ReLU-family activations, residual/skip connections, batch or layer normalization, careful initialization, and gated architectures like LSTMs.

How to think about it

Vanishing gradients occur when gradients shrink toward zero as they propagate back through many layers, so early layers learn extremely slowly or not at all; it is common with sigmoid or tanh activations in deep networks. Mitigations include ReLU-family activations, residual/skip connections, batch or layer normalization, careful initialization, and gated architectures like LSTMs.

Learn it properly Vanishing & exploding gradients

Keep practising

What is the vanishing gradient problem and how do you fix it? Why does sigmoid saturation cause vanishing gradients, and why is tanh only a partial fix? How do LSTM gates solve the vanishing gradient problem? What is the dying ReLU problem and how do you prevent it? What causes exploding gradients and how is gradient clipping a fix?

All Deep Learning questions

Explore further

Gradient descent The training loop Gradient Descent (One Step)

Vanishing Gradient Exploding Gradient Gradient Clipping Residual Connection