Deep Learning Medium Asked at GoogleAsked at NVIDIAAsked at MicrosoftAsked at Meta

Why does weight initialization matter and how do Xavier and He initialization work?

For Data Scientist ML Engineer AI / LLM Engineer

The short answer

Poor initialization causes the variance of activations to either explode or collapse across layers, triggering vanishing or exploding gradients before training even begins. Xavier initialization targets variance preservation for saturating activations; He initialization corrects for the halved variance caused by ReLU zeroing negative inputs.

How to think about it

The variance propagation problem:

If a layer has n_in inputs and weights drawn from a distribution with variance σ², the output variance is n_in · σ². For this to equal the input variance (so signals neither blow up nor die out), we need σ² = 1 / n_in.

Xavier / Glorot initialization (2010) — targets Var[W] = 2 / (n_in + n_out), a harmonic mean that keeps variance stable in both the forward pass and the backward pass:

# PyTorch does this for nn.Linear by default (kaiming_uniform_ for linear is the default,
# but you can be explicit)
nn.init.xavier_uniform_(layer.weight)   # uniform variant
nn.init.xavier_normal_(layer.weight)    # normal variant

Xavier was derived assuming linear or tanh activations (symmetric, derivative ≈ 1 near zero).

He / Kaiming initialization (2015) — ReLU kills roughly half its inputs (the negative ones), so the effective fan-in is halved. He scales by 2 / n_in to compensate:

nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu')
# PyTorch nn.Linear + ReLU stacks should use this

Practical defaults:

Activation	Recommended init
Linear / tanh / sigmoid	Xavier normal
ReLU / leaky ReLU	He (kaiming) normal
SELU	LeCun normal
Transformer attention	Xavier uniform (common convention)

Biases are almost always initialized to zero — non-zero bias init rarely helps and can break symmetry assumptions.

Learn it properly Activation functions

Why does weight initialization matter and how do Xavier and He initialization work?

Keep practising

Explore further