What is gradient accumulation and why is it useful?

For ML Engineer MLOps Engineer research-engineer

The short answer

Gradient accumulation runs several forward and backward passes without zeroing gradients, sums them, and only steps the optimizer after N micro-batches, simulating a larger effective batch size than fits in memory. It lets you train with large effective batches on limited GPU memory at the cost of more compute per update.

How to think about it

Learn it properly Batch size ↔ learning rate

Keep practising

What is gradient accumulation and when do you need it? What is the difference between activation checkpointing and gradient accumulation? How does gradient checkpointing reduce GPU memory, and what is the trade-off? How does batch size affect training — speed, convergence, and generalisation? What is gradient clipping and when would you use it?

All Deep Learning questions

Explore further

Activation checkpointing Gradient descent The training loop

Batch Size Exploding Gradient Gradient Clipping Vanishing Gradient