datarekha
Deep Learning Medium Asked at GoogleAsked at NVIDIAAsked at MicrosoftAsked at Meta

Why does weight initialization matter and how do Xavier and He initialization work?

The short answer

Poor initialization causes the variance of activations to either explode or collapse across layers, triggering vanishing or exploding gradients before training even begins. Xavier initialization targets variance preservation for saturating activations; He initialization corrects for the halved variance caused by ReLU zeroing negative inputs.

How to think about it

The variance propagation problem:

If a layer has n_in inputs and weights drawn from a distribution with variance σ², the output variance is n_in · σ². For this to equal the input variance (so signals neither blow up nor die out), we need σ² = 1 / n_in.

Xavier / Glorot initialization (2010) — targets Var[W] = 2 / (n_in + n_out), a harmonic mean that keeps variance stable in both the forward pass and the backward pass:

# PyTorch does this for nn.Linear by default (kaiming_uniform_ for linear is the default,
# but you can be explicit)
nn.init.xavier_uniform_(layer.weight)   # uniform variant
nn.init.xavier_normal_(layer.weight)    # normal variant

Xavier was derived assuming linear or tanh activations (symmetric, derivative ≈ 1 near zero).

He / Kaiming initialization (2015) — ReLU kills roughly half its inputs (the negative ones), so the effective fan-in is halved. He scales by 2 / n_in to compensate:

nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu')
# PyTorch nn.Linear + ReLU stacks should use this

Practical defaults:

ActivationRecommended init
Linear / tanh / sigmoidXavier normal
ReLU / leaky ReLUHe (kaiming) normal
SELULeCun normal
Transformer attentionXavier uniform (common convention)

Biases are almost always initialized to zero — non-zero bias init rarely helps and can break symmetry assumptions.

Learn it properly Activation functions

Keep practising

All Deep Learning questions

Explore further

Skip to content