Deep Learning Medium Asked at GoogleAsked at MetaAsked at OpenAIAsked at NVIDIA

Compare sigmoid, tanh, ReLU, leaky ReLU, and GELU — when would you pick each?

For Data Scientist ML Engineer AI / LLM Engineer

The short answer

Sigmoid squashes to (0,1) and saturates at extremes, causing vanishing gradients. Tanh is zero-centered but still saturates. ReLU avoids saturation for positive inputs and trains fast but can produce dead neurons. Leaky ReLU fixes dying neurons. GELU is smooth and probabilistic, now the default in most transformer architectures.

How to think about it

Activation	Formula	Range	Zero-centered	Saturates	Typical use
Sigmoid	`1 / (1 + e^{-z})`	(0, 1)	No	Both tails	Binary output head
Tanh	`(e^z - e^{-z}) / (e^z + e^{-z})`	(-1, 1)	Yes	Both tails	RNNs, some hidden layers
ReLU	`max(0, z)`	[0, ∞)	No	Negative half	Default for CNNs/MLPs
Leaky ReLU	`max(αz, z)`, α≈0.01	(-∞, ∞)	Approx	Never	When dead neurons observed
GELU	`z · Φ(z)` (Gaussian CDF)	(-∞, ∞)	Approx	Never	Transformers (BERT, GPT)

Gradient perspective:

Sigmoid derivative peaks at 0.25. Through 10 layers that’s 0.25^10 ≈ 0.000001 — vanishing gradients in practice.
Tanh derivative peaks at 1.0 — better but still saturates.
ReLU derivative is 1 for z > 0, 0 otherwise — no shrinkage on the positive side.
GELU is smooth everywhere; the stochastic interpretation (multiply input by probability it is positive) tends to produce better representations in attention-based models.

import torch.nn.functional as F

F.sigmoid(z)        # output head for binary classification
F.tanh(z)           # recurrent state gates
F.relu(z)           # hidden layers in MLP / CNN
F.leaky_relu(z, 0.01)
F.gelu(z)           # transformer feed-forward blocks

Rule of thumb: start with ReLU for CNNs/MLPs; switch to GELU for transformer-style architectures; use sigmoid only on the output for binary tasks; tanh inside LSTM/GRU gates where the architecture was designed around it.

Sigmoid (purple), tanh (green), and ReLU (amber) from z = -3 to 3. Note sigmoid never reaches 0 or 1; tanh saturates harder than it looks near ±3.

Learn it properly Activation functions

Compare sigmoid, tanh, ReLU, leaky ReLU, and GELU — when would you pick each?

Keep practising

Explore further