What is the difference between Xavier (Glorot) and He initialization, and when do you use each?

The short answer

Both scale initial weights based on layer fan-in and fan-out to keep activation and gradient variance stable across layers. Xavier (Glorot) assumes a symmetric activation like tanh or sigmoid, while He initialization uses a larger variance tuned for ReLU-family activations, which zero out half their inputs. Use Xavier with tanh or sigmoid and He with ReLU or LeakyReLU.

How to think about it

Both scale initial weights based on layer fan-in and fan-out to keep activation and gradient variance stable across layers. Xavier (Glorot) assumes a symmetric activation like tanh or sigmoid, while He initialization uses a larger variance tuned for ReLU-family activations, which zero out half their inputs. Use Xavier with tanh or sigmoid and He with ReLU or LeakyReLU.

Learn it properly Weight initialization

Keep practising

Why does weight initialization matter and how do Xavier and He initialization work? Compare sigmoid, tanh, ReLU, leaky ReLU, and GELU — when would you pick each? What is GELU and why does it outperform ReLU in transformer models? What is the vanishing gradient problem, and how do you address it? What is the vanishing gradient problem and how do you fix it?

All Deep Learning questions

Explore further

Activation functions Multi-Layer Perceptron & Activations SGD → Adam → AdamW

Weight Initialization ReLU Vanishing Gradient Activation Function