datarekha
Deep Learning Medium Asked at GoogleAsked at OpenAIAsked at NVIDIAAsked at Meta

What is GELU and why does it outperform ReLU in transformer models?

The short answer

GELU (Gaussian Error Linear Unit) multiplies the input by the probability that a standard Gaussian random variable is smaller than it, producing a smooth, non-monotonic curve that approximates ReLU but with a stochastic regularization flavor. Transformers favor GELU because the smooth gradient near zero improves optimization in deep attention-based architectures.

How to think about it

Definition:

GELU(x) = x · Φ(x)

where Φ(x) is the CDF of the standard normal distribution. In practice a fast approximation is used:

GELU(x) ≈ 0.5 · x · (1 + tanh(√(2/π) · (x + 0.044715 · x³)))

Properties that matter:

  • Smooth everywhere — unlike ReLU’s hard hinge at 0. Smoother loss landscape.
  • Non-monotonic — has a slight dip below zero around x ≈ -0.17 before recovering, which provides a soft gate effect.
  • Non-zero gradient for negative inputs — no dying neuron problem.
  • Stochastic interpretation — GELU(x) is the expected value of x · Bernoulli(Φ(x)). The neuron probabilistically gates its own input based on magnitude. Inputs far above zero almost always pass through; inputs far below zero are almost always zeroed.

Why transformers prefer GELU:

In attention-based architectures the pre-activations entering the feed-forward blocks have approximately Gaussian statistics. GELU’s Gaussian-CDF gate is therefore well-matched to the actual input distribution, providing a softer threshold than ReLU’s hard zero. Empirically (Hendrycks and Gimpel, 2016; BERT, GPT-2 papers) GELU consistently gives 0.5–1.5% accuracy improvements over ReLU at scale.

import torch.nn.functional as F

# PyTorch native GELU (exact)
out = F.gelu(z)

# Approximate version (used in older HuggingFace code)
out = F.gelu(z, approximate='tanh')

SwiGLU — used in LLaMA / PaLM — is a gated variant of GELU: SwiGLU(x, W, V) = Swish(xW) ⊙ (xV). It pushes the gating idea further with a learned gate matrix.

Learn it properly Activation functions

Keep practising

All Deep Learning questions

Explore further

Skip to content