datarekha

What is gradient accumulation and why is it useful?

The short answer

Gradient accumulation runs several forward and backward passes without zeroing gradients, sums them, and only steps the optimizer after N micro-batches, simulating a larger effective batch size than fits in memory. It lets you train with large effective batches on limited GPU memory at the cost of more compute per update.

How to think about it

Gradient accumulation runs several forward and backward passes without zeroing gradients, sums them, and only steps the optimizer after N micro-batches, simulating a larger effective batch size than fits in memory. It lets you train with large effective batches on limited GPU memory at the cost of more compute per update.

Learn it properly Batch size ↔ learning rate

Keep practising

All Deep Learning questions

Explore further

Skip to content