Why does sigmoid saturation cause vanishing gradients, and why is tanh only a partial fix?
Sigmoid's derivative peaks at 0.25 and approaches zero in both tails, so the chain of gradient multiplications collapses exponentially in deep networks. Tanh's derivative peaks at 1 and is zero-centered, which helps weight update symmetry, but it still saturates at large magnitudes and the gradient still shrinks to near-zero in both tails.
How to think about it
Sigmoid saturation:
σ(z) = 1 / (1 + e^{-z})
σ'(z) = σ(z) · (1 - σ(z))
At z = 0: σ'(0) = 0.25 (maximum).
At z = 3: σ'(3) ≈ 0.045.
At z = 6: σ'(6) ≈ 0.002.
In a 20-layer network all initialized with moderate weights, many neurons will have |z| > 2. Multiplying twenty derivatives each around 0.05 gives 0.05^{20} ≈ 10^{-26} — the gradient signal reaching the first layer is numerically zero.
Why tanh is better but not solved:
tanh'(z) = 1 - tanh(z)²
At z = 0: tanh'(0) = 1 (four times larger than sigmoid).
At z = 2: tanh'(2) ≈ 0.07.
At z = 3: tanh'(3) ≈ 0.01.
So tanh gives a better gradient near the origin — helpful for early training — but still collapses in the tails. In a 20-layer net with saturated units, the problem remains.
Additionally, tanh’s zero-centering does help one known pathology of sigmoid: because sigmoid outputs are always positive, the gradients w.r.t. a layer’s weights all share the same sign, causing correlated zig-zag updates. Tanh removes this bias.
import torch
z_vals = torch.tensor([-3.0, -1.0, 0.0, 1.0, 3.0])
sig = torch.sigmoid(z_vals)
sig_grad = sig * (1 - sig)
# tensor([0.0452, 0.1966, 0.2500, 0.1966, 0.0452])
tanh_val = torch.tanh(z_vals)
tanh_grad = 1 - tanh_val ** 2
# tensor([0.0099, 0.4200, 1.0000, 0.4200, 0.0099])
The pattern is clear: at ±3, sigmoid gradient is 0.045 and tanh gradient is 0.01. Both are small, just from different starting points.