Compare sigmoid, tanh, ReLU, leaky ReLU, and GELU — when would you pick each?
The short answer
Sigmoid squashes to (0,1) and saturates at extremes, causing vanishing gradients. Tanh is zero-centered but still saturates. ReLU avoids saturation for positive inputs and trains fast but can produce dead neurons. Leaky ReLU fixes dying neurons. GELU is smooth and probabilistic, now the default in most transformer architectures.
How to think about it
| Activation | Formula | Range | Zero-centered | Saturates | Typical use |
|---|---|---|---|---|---|
| Sigmoid | 1 / (1 + e^{-z}) | (0, 1) | No | Both tails | Binary output head |
| Tanh | (e^z - e^{-z}) / (e^z + e^{-z}) | (-1, 1) | Yes | Both tails | RNNs, some hidden layers |
| ReLU | max(0, z) | [0, ∞) | No | Negative half | Default for CNNs/MLPs |
| Leaky ReLU | max(αz, z), α≈0.01 | (-∞, ∞) | Approx | Never | When dead neurons observed |
| GELU | z · Φ(z) (Gaussian CDF) | (-∞, ∞) | Approx | Never | Transformers (BERT, GPT) |
Gradient perspective:
- Sigmoid derivative peaks at 0.25. Through 10 layers that’s
0.25^10 ≈ 0.000001— vanishing gradients in practice. - Tanh derivative peaks at 1.0 — better but still saturates.
- ReLU derivative is 1 for
z > 0, 0 otherwise — no shrinkage on the positive side. - GELU is smooth everywhere; the stochastic interpretation (multiply input by probability it is positive) tends to produce better representations in attention-based models.
import torch.nn.functional as F
F.sigmoid(z) # output head for binary classification
F.tanh(z) # recurrent state gates
F.relu(z) # hidden layers in MLP / CNN
F.leaky_relu(z, 0.01)
F.gelu(z) # transformer feed-forward blocks
Rule of thumb: start with ReLU for CNNs/MLPs; switch to GELU for transformer-style architectures; use sigmoid only on the output for binary tasks; tanh inside LSTM/GRU gates where the architecture was designed around it.