Deep Learning Medium Asked at GoogleAsked at MetaAsked at Hugging FaceAsked at Microsoft

What is gradient accumulation and when do you need it?

The short answer

Gradient accumulation sums gradients over multiple small forward-backward passes before calling the optimizer, simulating a larger effective batch size without requiring the memory to hold it all at once. It is the standard workaround when the desired batch size does not fit in GPU memory.

How to think about it

Large batch sizes stabilise gradient estimates, which lets you use higher learning rates and train faster — a well-established empirical finding (Goyal et al., 2017). But a batch of 512 ImageNet images at float32 can exceed 24 GB. Gradient accumulation lets you get the benefits without the VRAM.

How it works

Instead of one forward-backward-update cycle per batch, you do N forward-backward passes accumulating gradients, then one optimizer step.

accumulation_steps = 8          # effective batch = 8 * per_device_batch
optimizer.zero_grad()

for step, batch in enumerate(dataloader):
    output = model(batch["input"])
    loss = criterion(output, batch["label"])

    # Normalise loss so the accumulated gradient matches a true large batch
    loss = loss / accumulation_steps
    loss.backward()             # gradients accumulate in .grad buffers

    if (step + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

The division by accumulation_steps is non-optional: without it the effective learning rate scales up linearly with N, destabilising training.

With mixed precision (GradScaler)

scaler.scale(loss / accumulation_steps).backward()

if (step + 1) % accumulation_steps == 0:
    scaler.step(optimizer)
    scaler.update()
    optimizer.zero_grad()

With Hugging Face Trainer

from transformers import TrainingArguments

args = TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,   # effective batch = 32
    ...
)

Trade-off

Gradient accumulation does not reduce per-step compute — you still do the same number of forward passes. It only reduces peak memory. Wall-clock time per update increases because you do N sequential micro-batches. DataParallel or DeepSpeed ZeRO with data parallelism is faster if you have multiple GPUs.

What is gradient accumulation and when do you need it?

Keep practising

Explore further