What is the difference between Xavier (Glorot) and He initialization, and when do you use each?
Both scale initial weights based on layer fan-in and fan-out to keep activation and gradient variance stable across layers. Xavier (Glorot) assumes a symmetric activation like tanh or sigmoid, while He initialization uses a larger variance tuned for ReLU-family activations, which zero out half their inputs. Use Xavier with tanh or sigmoid and He with ReLU or LeakyReLU.
How to think about it
Both scale initial weights based on layer fan-in and fan-out to keep activation and gradient variance stable across layers. Xavier (Glorot) assumes a symmetric activation like tanh or sigmoid, while He initialization uses a larger variance tuned for ReLU-family activations, which zero out half their inputs. Use Xavier with tanh or sigmoid and He with ReLU or LeakyReLU.