Deep Learning Medium Asked at GoogleAsked at AmazonAsked at Meta

Why does sigmoid saturation cause vanishing gradients, and why is tanh only a partial fix?

For Data Scientist ML Engineer AI / LLM Engineer

The short answer

Sigmoid's derivative peaks at 0.25 and approaches zero in both tails, so the chain of gradient multiplications collapses exponentially in deep networks. Tanh's derivative peaks at 1 and is zero-centered, which helps weight update symmetry, but it still saturates at large magnitudes and the gradient still shrinks to near-zero in both tails.

How to think about it

Sigmoid saturation:

σ(z) = 1 / (1 + e^{-z})
σ'(z) = σ(z) · (1 - σ(z))

At z = 0: σ'(0) = 0.25 (maximum). At z = 3: σ'(3) ≈ 0.045. At z = 6: σ'(6) ≈ 0.002.

In a 20-layer network all initialized with moderate weights, many neurons will have |z| > 2. Multiplying twenty derivatives each around 0.05 gives 0.05^{20} ≈ 10^{-26} — the gradient signal reaching the first layer is numerically zero.

Why tanh is better but not solved:

tanh'(z) = 1 - tanh(z)²

At z = 0: tanh'(0) = 1 (four times larger than sigmoid). At z = 2: tanh'(2) ≈ 0.07. At z = 3: tanh'(3) ≈ 0.01.

So tanh gives a better gradient near the origin — helpful for early training — but still collapses in the tails. In a 20-layer net with saturated units, the problem remains.

Additionally, tanh’s zero-centering does help one known pathology of sigmoid: because sigmoid outputs are always positive, the gradients w.r.t. a layer’s weights all share the same sign, causing correlated zig-zag updates. Tanh removes this bias.

import torch

z_vals = torch.tensor([-3.0, -1.0, 0.0, 1.0, 3.0])
sig = torch.sigmoid(z_vals)
sig_grad = sig * (1 - sig)
# tensor([0.0452, 0.1966, 0.2500, 0.1966, 0.0452])

tanh_val = torch.tanh(z_vals)
tanh_grad = 1 - tanh_val ** 2
# tensor([0.0099, 0.4200, 1.0000, 0.4200, 0.0099])

The pattern is clear: at ±3, sigmoid gradient is 0.045 and tanh gradient is 0.01. Both are small, just from different starting points.

Learn it properly Backpropagation foundations

Why does sigmoid saturation cause vanishing gradients, and why is tanh only a partial fix?

Keep practising

Explore further