Why does weight initialization matter and how do Xavier and He initialization work?
Poor initialization causes the variance of activations to either explode or collapse across layers, triggering vanishing or exploding gradients before training even begins. Xavier initialization targets variance preservation for saturating activations; He initialization corrects for the halved variance caused by ReLU zeroing negative inputs.
How to think about it
The variance propagation problem:
If a layer has n_in inputs and weights drawn from a distribution with variance σ², the output variance is n_in · σ². For this to equal the input variance (so signals neither blow up nor die out), we need σ² = 1 / n_in.
Xavier / Glorot initialization (2010) — targets Var[W] = 2 / (n_in + n_out), a harmonic mean that keeps variance stable in both the forward pass and the backward pass:
# PyTorch does this for nn.Linear by default (kaiming_uniform_ for linear is the default,
# but you can be explicit)
nn.init.xavier_uniform_(layer.weight) # uniform variant
nn.init.xavier_normal_(layer.weight) # normal variant
Xavier was derived assuming linear or tanh activations (symmetric, derivative ≈ 1 near zero).
He / Kaiming initialization (2015) — ReLU kills roughly half its inputs (the negative ones), so the effective fan-in is halved. He scales by 2 / n_in to compensate:
nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu')
# PyTorch nn.Linear + ReLU stacks should use this
Practical defaults:
| Activation | Recommended init |
|---|---|
| Linear / tanh / sigmoid | Xavier normal |
| ReLU / leaky ReLU | He (kaiming) normal |
| SELU | LeCun normal |
| Transformer attention | Xavier uniform (common convention) |
Biases are almost always initialized to zero — non-zero bias init rarely helps and can break symmetry assumptions.