datarekha

How are batch size and learning rate related, and what is learning-rate warmup?

The short answer

Larger batches give lower-variance gradient estimates, so they typically allow and often need a proportionally larger learning rate, while very high learning rates early in training can destabilize it. Warmup ramps the learning rate up from a small value over the first steps to avoid early divergence, then follows a decay schedule.

How to think about it

Larger batches give lower-variance gradient estimates, so they typically allow and often need a proportionally larger learning rate, while very high learning rates early in training can destabilize it. Warmup ramps the learning rate up from a small value over the first steps to avoid early divergence, then follows a decay schedule.

Learn it properly Batch size ↔ learning rate

Keep practising

All Deep Learning questions

Explore further

Skip to content