datarekha
Deep Learning Medium Asked at GoogleAsked at MetaAsked at OpenAIAsked at NVIDIA

Compare sigmoid, tanh, ReLU, leaky ReLU, and GELU — when would you pick each?

The short answer

Sigmoid squashes to (0,1) and saturates at extremes, causing vanishing gradients. Tanh is zero-centered but still saturates. ReLU avoids saturation for positive inputs and trains fast but can produce dead neurons. Leaky ReLU fixes dying neurons. GELU is smooth and probabilistic, now the default in most transformer architectures.

How to think about it

ActivationFormulaRangeZero-centeredSaturatesTypical use
Sigmoid1 / (1 + e^{-z})(0, 1)NoBoth tailsBinary output head
Tanh(e^z - e^{-z}) / (e^z + e^{-z})(-1, 1)YesBoth tailsRNNs, some hidden layers
ReLUmax(0, z)[0, ∞)NoNegative halfDefault for CNNs/MLPs
Leaky ReLUmax(αz, z), α≈0.01(-∞, ∞)ApproxNeverWhen dead neurons observed
GELUz · Φ(z) (Gaussian CDF)(-∞, ∞)ApproxNeverTransformers (BERT, GPT)

Gradient perspective:

  • Sigmoid derivative peaks at 0.25. Through 10 layers that’s 0.25^10 ≈ 0.000001 — vanishing gradients in practice.
  • Tanh derivative peaks at 1.0 — better but still saturates.
  • ReLU derivative is 1 for z > 0, 0 otherwise — no shrinkage on the positive side.
  • GELU is smooth everywhere; the stochastic interpretation (multiply input by probability it is positive) tends to produce better representations in attention-based models.
import torch.nn.functional as F

F.sigmoid(z)        # output head for binary classification
F.tanh(z)           # recurrent state gates
F.relu(z)           # hidden layers in MLP / CNN
F.leaky_relu(z, 0.01)
F.gelu(z)           # transformer feed-forward blocks

Rule of thumb: start with ReLU for CNNs/MLPs; switch to GELU for transformer-style architectures; use sigmoid only on the output for binary tasks; tanh inside LSTM/GRU gates where the architecture was designed around it.

zf(z)-33σtanhReLU
Sigmoid (purple), tanh (green), and ReLU (amber) from z = -3 to 3. Note sigmoid never reaches 0 or 1; tanh saturates harder than it looks near ±3.
Learn it properly Activation functions

Keep practising

All Deep Learning questions

Explore further

Skip to content