Deep Learning Medium
How are batch size and learning rate related, and what is learning-rate warmup?
The short answer
Larger batches give lower-variance gradient estimates, so they typically allow and often need a proportionally larger learning rate, while very high learning rates early in training can destabilize it. Warmup ramps the learning rate up from a small value over the first steps to avoid early divergence, then follows a decay schedule.
How to think about it
Larger batches give lower-variance gradient estimates, so they typically allow and often need a proportionally larger learning rate, while very high learning rates early in training can destabilize it. Warmup ramps the learning rate up from a small value over the first steps to avoid early divergence, then follows a decay schedule.