How does batch size affect training — speed, convergence, and generalisation?
Larger batches give more accurate gradient estimates and enable higher GPU utilisation, but they tend to converge to sharper minima that generalise worse. Smaller batches introduce gradient noise that acts as implicit regularisation, helping the optimiser escape sharp minima and often finding flatter, better-generalising solutions — at the cost of slower wall-clock training per epoch.
How to think about it
Batch size is one of the most practically important hyperparameters because it couples hardware utilisation, convergence speed, and final model quality.
Hardware efficiency
Larger batches fill GPU tensor cores more efficiently. Doubling the batch size roughly doubles per-step throughput up to memory limits, reducing time per epoch. Beyond the memory limit, you must use gradient accumulation.
# Gradient accumulation — effective batch = batch_size × accumulation_steps
accumulation_steps = 4
optimizer.zero_grad()
for i, (x, y) in enumerate(dataloader):
loss = model(x, y) / accumulation_steps
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
Gradient noise and generalisation
With batch size B, the gradient is an average over B samples. As B grows, noise falls like 1/√B. This noise is not pure harm — it helps SGD escape sharp minima. The linear scaling rule (Goyal et al., Facebook) says: when you multiply batch size by k, multiply LR by k to compensate, then warm up from the original LR. This works well up to about 8k batch size for image classification.
Sharpness of minima
Large-batch training reliably converges to sharper minima — regions where the Hessian has larger eigenvalues. Sharp minima generalise worse because a small perturbation in weight space (corresponding to distributional shift at inference) causes large loss jumps. Small batches land in flatter basins.
| Batch size | Gradient quality | Generalisation | Hardware use |
|---|---|---|---|
| Small (32–256) | Noisy | Better | Lower |
| Large (2k+) | Clean | Worse (needs tricks) | Higher |
Rules of thumb
- Default to 256–512 for vision; 16–64 per GPU for language fine-tuning.
- If increasing batch size: scale LR linearly and use warmup.
- For transformers at scale, large batches with AdamW + cosine schedule are standard.