What is the difference between Xavier (Glorot) and He initialization, and when do you use each?

Both scale initial weights based on layer fan-in and fan-out to keep activation and gradient variance stable across layers. Xavier (Glorot) assumes a symmetric activation like tanh or sigmoid, while He initialization uses a larger variance tuned for ReLU-family activations, which zero out half their inputs. Use Xavier with tanh or sigmoid and He with ReLU or LeakyReLU.

Why does weight initialization matter and how do Xavier and He initialization work?

Poor initialization causes the variance of activations to either explode or collapse across layers, triggering vanishing or exploding gradients before training even begins. Xavier initialization targets variance preservation for saturating activations; He initialization corrects for the halved variance caused by ReLU zeroing negative inputs.

What is the vanishing gradient problem and how do you fix it?

Vanishing gradients occur when repeated multiplication of small derivatives during backpropagation drives gradients toward zero, starving early layers of learning signal. The main fixes are better activations (ReLU/GELU), residual connections, batch normalization, and careful weight initialization.

How does batch size affect training — speed, convergence, and generalisation?

Larger batches give more accurate gradient estimates and enable higher GPU utilisation, but they tend to converge to sharper minima that generalise worse. Smaller batches introduce gradient noise that acts as implicit regularisation, helping the optimiser escape sharp minima and often finding flatter, better-generalising solutions — at the cost of slower wall-clock training per epoch.

Weight initialization — Deep Learning

In the training-loop demo we started one neuron at w = 0, b = 0 and it learned fine. Try that in a fifty-layer network and it will never train — not slowly, never. Initialization is the least glamorous choice in deep learning and one of the most decisive: get the scale wrong and your gradients are dead before the first step.

TryWeight init · signal through 12 layers

Watch the signal survive — or die — with depth

A unit-variance signal is pushed through 12 layers. Each bar is the activation std at that layer (log scale). The right init keeps it near the dashed line; the wrong scale makes it vanish to zero or explode. Toggle the activation and watch which init matches.

layer 1activation std (log scale)layer 12

weight std ≈ 1 · final std = 2.8e+12. Exploding — signal blows up, gradients will too

The problem: variance compounds with depth

Picture a signal flowing forward through layers. Each layer multiplies it by a weight matrix. A rough rule for the variance of the output of one layer:

var(output) ≈ fan_in × var(weights) × var(input)

where fan_in is the number of inputs to the layer (often hundreds or thousands). Now stack many layers. That multiplier applies every layer, so the variance grows or shrinks exponentially with depth:

Weights too big → fan_in × var(weights) > 1 → variance explodes → activations saturate, gradients blow up to NaN.
Weights too small → multiplier < 1 → variance decays toward zero → activations vanish, gradients vanish, nothing updates.

The signal has to thread a needle: each layer must roughly preserve the variance — naive init explodes it, too-small init vanishes it, and only the right scale holds it near 1 all the way down:

The fix: scale to fan_in

The whole game is choosing var(weights) so the per-layer multiplier is ~1. Two schemes dominate, and which one is correct depends on the activation:

Xavier / Glorot — var(weights) = 1 / fan_in (often 2 / (fan_in + fan_out)). Designed for tanh / sigmoid, whose linear regime preserves variance.
Kaiming / He — var(weights) = 2 / fan_in. Designed for ReLU. The factor of 2 compensates for ReLU zeroing out half its inputs, which otherwise halves the variance every layer.

The rule of thumb: ReLU → He, tanh/sigmoid → Xavier. PyTorch’s defaults are reasonable, but for ReLU nets you often set it explicitly:

import torch.nn as nn
layer = nn.Linear(512, 512)
nn.init.kaiming_normal_(layer.weight, nonlinearity="relu")  # He init
nn.init.zeros_(layer.bias)                                   # bias starts at 0

See the variance survive (or not)

The same thing with real numbers: propagate a signal through 15 ReLU layers and print the activation std at each. Naive init explodes; He init holds steady near 1.

import numpy as np

rng = np.random.default_rng(0)
n, layers = 256, 15
x0 = rng.standard_normal(n)                 # unit-variance input

def propagate(weight_std_fn):
    a = x0.copy()
    out = []
    for _ in range(layers):
        W = rng.standard_normal((n, n)) * weight_std_fn(n)
        a = np.maximum(0, W @ a)            # ReLU
        out.append(a.std())
    return out

naive = propagate(lambda fan_in: 1.0)              # std = 1  → explodes
he    = propagate(lambda fan_in: np.sqrt(2/fan_in))# He      → preserved

print("layer :   naive       He")
for i, (nv, h) in enumerate(zip(naive, he), 1):
    print(f"  {i:2d}  : {nv:9.2e}  {h:7.3f}")

layer :   naive       He
   1  :  9.66e+00    0.807
   2  :  1.02e+02    0.850
   3  :  1.15e+03    0.849
   4  :  1.42e+04    0.872
   5  :  1.66e+05    0.801
   6  :  1.60e+06    0.687
   7  :  1.70e+07    0.691
   8  :  2.29e+08    0.704
   9  :  2.82e+09    0.724
  10  :  3.18e+10    0.804
  11  :  3.63e+11    0.772
  12  :  3.90e+12    0.826
  13  :  4.46e+13    0.930
  14  :  4.80e+14    0.963
  15  :  5.30e+15    1.126

The naive column races off to enormous numbers; the He column stays near 1 all the way down. That difference is the difference between a network that trains and one that returns NaN on step one.

In one breath

Each layer multiplies activation variance by ~fan_in × var(weights), so over depth the signal explodes or vanishes exponentially unless each layer roughly preserves it.
All-zero (or all-identical) weights break symmetry: every neuron computes the same thing and updates identically forever — start random and asymmetric (biases can be zero).
Scale var(weights) to fan_in: Xavier/Glorot (1/fan_in) for tanh/sigmoid, Kaiming/He (2/fan_in) for ReLU — the factor of 2 offsets ReLU zeroing half its inputs.
Verified above: naive N(0,1) init explodes to ~1e15 over 15 ReLU layers while He init holds the std near 1.
Very deep transformers add a 1/√(2·n_layers) scale on residual-writing layers to keep the residual stream’s variance controlled.

Quick check

0/3

Q1Why does naive N(0,1) initialization fail in a deep network?

Q2Why can't you initialize all the weights in a layer to the same value (e.g. zero)?

Q3You're building a ReLU network. Which initialization should you reach for?

Good init gives the gradients a fighting start. But depth can still strangle them mid-training — vanishing & exploding gradients covers how to diagnose that with gradient norms and fix it with clipping and normalization.

Weight initialization

What you'll learn

Before you start

Watch the signal survive — or die — with depth

The problem: variance compounds with depth

The fix: scale to fan_in

See the variance survive (or not)

In one breath

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further