datarekha
Deep Learning Medium Asked at GoogleAsked at MetaAsked at NVIDIAAsked at OpenAI

How does batch size affect training — speed, convergence, and generalisation?

The short answer

Larger batches give more accurate gradient estimates and enable higher GPU utilisation, but they tend to converge to sharper minima that generalise worse. Smaller batches introduce gradient noise that acts as implicit regularisation, helping the optimiser escape sharp minima and often finding flatter, better-generalising solutions — at the cost of slower wall-clock training per epoch.

How to think about it

Batch size is one of the most practically important hyperparameters because it couples hardware utilisation, convergence speed, and final model quality.

Hardware efficiency

Larger batches fill GPU tensor cores more efficiently. Doubling the batch size roughly doubles per-step throughput up to memory limits, reducing time per epoch. Beyond the memory limit, you must use gradient accumulation.

# Gradient accumulation — effective batch = batch_size × accumulation_steps
accumulation_steps = 4
optimizer.zero_grad()

for i, (x, y) in enumerate(dataloader):
    loss = model(x, y) / accumulation_steps
    loss.backward()
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Gradient noise and generalisation

With batch size B, the gradient is an average over B samples. As B grows, noise falls like 1/√B. This noise is not pure harm — it helps SGD escape sharp minima. The linear scaling rule (Goyal et al., Facebook) says: when you multiply batch size by k, multiply LR by k to compensate, then warm up from the original LR. This works well up to about 8k batch size for image classification.

Sharpness of minima

Large-batch training reliably converges to sharper minima — regions where the Hessian has larger eigenvalues. Sharp minima generalise worse because a small perturbation in weight space (corresponding to distributional shift at inference) causes large loss jumps. Small batches land in flatter basins.

Batch sizeGradient qualityGeneralisationHardware use
Small (32–256)NoisyBetterLower
Large (2k+)CleanWorse (needs tricks)Higher

Rules of thumb

  • Default to 256–512 for vision; 16–64 per GPU for language fine-tuning.
  • If increasing batch size: scale LR linearly and use warmup.
  • For transformers at scale, large batches with AdamW + cosine schedule are standard.
Learn it properly SGD → Adam → AdamW

Keep practising

All Deep Learning questions

Explore further

Skip to content