What is GELU and why does it outperform ReLU in transformer models?
GELU (Gaussian Error Linear Unit) multiplies the input by the probability that a standard Gaussian random variable is smaller than it, producing a smooth, non-monotonic curve that approximates ReLU but with a stochastic regularization flavor. Transformers favor GELU because the smooth gradient near zero improves optimization in deep attention-based architectures.
How to think about it
Definition:
GELU(x) = x · Φ(x)
where Φ(x) is the CDF of the standard normal distribution. In practice a fast approximation is used:
GELU(x) ≈ 0.5 · x · (1 + tanh(√(2/π) · (x + 0.044715 · x³)))
Properties that matter:
- Smooth everywhere — unlike ReLU’s hard hinge at 0. Smoother loss landscape.
- Non-monotonic — has a slight dip below zero around
x ≈ -0.17before recovering, which provides a soft gate effect. - Non-zero gradient for negative inputs — no dying neuron problem.
- Stochastic interpretation — GELU(x) is the expected value of
x · Bernoulli(Φ(x)). The neuron probabilistically gates its own input based on magnitude. Inputs far above zero almost always pass through; inputs far below zero are almost always zeroed.
Why transformers prefer GELU:
In attention-based architectures the pre-activations entering the feed-forward blocks have approximately Gaussian statistics. GELU’s Gaussian-CDF gate is therefore well-matched to the actual input distribution, providing a softer threshold than ReLU’s hard zero. Empirically (Hendrycks and Gimpel, 2016; BERT, GPT-2 papers) GELU consistently gives 0.5–1.5% accuracy improvements over ReLU at scale.
import torch.nn.functional as F
# PyTorch native GELU (exact)
out = F.gelu(z)
# Approximate version (used in older HuggingFace code)
out = F.gelu(z, approximate='tanh')
SwiGLU — used in LLaMA / PaLM — is a gated variant of GELU: SwiGLU(x, W, V) = Swish(xW) ⊙ (xV). It pushes the gating idea further with a learned gate matrix.