What is gradient accumulation and when do you need it?
Gradient accumulation sums gradients over multiple small forward-backward passes before calling the optimizer, simulating a larger effective batch size without requiring the memory to hold it all at once. It is the standard workaround when the desired batch size does not fit in GPU memory.
How to think about it
Large batch sizes stabilise gradient estimates, which lets you use higher learning rates and train faster — a well-established empirical finding (Goyal et al., 2017). But a batch of 512 ImageNet images at float32 can exceed 24 GB. Gradient accumulation lets you get the benefits without the VRAM.
How it works
Instead of one forward-backward-update cycle per batch, you do N forward-backward passes accumulating gradients, then one optimizer step.
accumulation_steps = 8 # effective batch = 8 * per_device_batch
optimizer.zero_grad()
for step, batch in enumerate(dataloader):
output = model(batch["input"])
loss = criterion(output, batch["label"])
# Normalise loss so the accumulated gradient matches a true large batch
loss = loss / accumulation_steps
loss.backward() # gradients accumulate in .grad buffers
if (step + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
The division by accumulation_steps is non-optional: without it the effective learning rate scales up linearly with N, destabilising training.
With mixed precision (GradScaler)
scaler.scale(loss / accumulation_steps).backward()
if (step + 1) % accumulation_steps == 0:
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
With Hugging Face Trainer
from transformers import TrainingArguments
args = TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=8, # effective batch = 32
...
)
Trade-off
Gradient accumulation does not reduce per-step compute — you still do the same number of forward passes. It only reduces peak memory. Wall-clock time per update increases because you do N sequential micro-batches. DataParallel or DeepSpeed ZeRO with data parallelism is faster if you have multiple GPUs.