How are batch size and learning rate related, and what is learning-rate warmup?

Larger batches give lower-variance gradient estimates, so they typically allow and often need a proportionally larger learning rate, while very high learning rates early in training can destabilize it. Warmup ramps the learning rate up from a small value over the first steps to avoid early divergence, then follows a decay schedule.

What is gradient accumulation and why is it useful?

Gradient accumulation runs several forward and backward passes without zeroing gradients, sums them, and only steps the optimizer after N micro-batches, simulating a larger effective batch size than fits in memory. It lets you train with large effective batches on limited GPU memory at the cost of more compute per update.

How does batch size affect training — speed, convergence, and generalisation?

Larger batches give more accurate gradient estimates and enable higher GPU utilisation, but they tend to converge to sharper minima that generalise worse. Smaller batches introduce gradient noise that acts as implicit regularisation, helping the optimiser escape sharp minima and often finding flatter, better-generalising solutions — at the cost of slower wall-clock training per epoch.

What is a learning rate schedule, and why is warmup important?

A learning rate schedule changes the learning rate during training rather than keeping it fixed. Warmup starts with a very small LR and ramps it up over the first few hundred or thousand steps, preventing early large gradient updates from destabilising freshly initialised weights. After warmup, the LR is typically decayed — via cosine annealing, step decay, or linear decay — so the optimiser can settle into a sharp minimum.

Batch size ↔ learning rate — Deep Learning

A beginner tunes batch size and learning rate as if they were independent. They are not. Change one and the other’s best value shifts with it — and missing that coupling is why “I increased the batch size and training got worse” is such a common complaint. The reason is simple: the batch size sets how noisy each gradient is, and the learning rate sets how far you step on that gradient.

Bigger batch = less noisy gradient

Each batch estimates the true gradient from a sample. A small batch is a noisy estimate; a large batch averages more examples, so its gradient is smoother — the noise shrinks roughly as 1/√(batch size). A small batch takes a jagged path to the minimum; a large batch’s smoother gradient lets it head more directly in:

A smoother gradient is more trustworthy, so you can afford a bigger step. That’s the intuition behind the rule everyone uses.

The linear scaling rule

When you multiply the batch size by k, multiply the learning rate by k.

Double the batch, double the learning rate. It’s an approximation, but it’s the standard starting point — it keeps the total distance traveled per epoch roughly constant as you scale up. (Some recipes use a square-root rule instead; linear is the common default for SGD-style training.)

Gradient accumulation: big batch on a small GPU

Want a batch of 512 but only 64 fit in memory? Accumulate gradients over several forward/backward passes before stepping. Because backprop adds to .grad (the same reason you call zero_grad()), running backward() 8 times without stepping sums 8 mini-batches’ gradients — mathematically a batch of 8 × 64 = 512.

import numpy as np

# Simulate accumulating gradients over micro-batches, then one big step.
rng = np.random.default_rng(0)
true_grad = np.array([1.0, -2.0])      # the gradient if we used the full batch

accum = np.zeros(2)
micro_batches = 8
for i in range(micro_batches):
    # each micro-batch gradient = true grad + sampling noise
    g = true_grad + rng.standard_normal(2) * 0.5
    accum += g                          # ADD, don't replace (like backward())

# average over the micro-batches → a low-noise estimate of the true gradient
effective_grad = accum / micro_batches
print("per-micro-batch noise is large; the average is close to the truth:")
print(f"  effective grad = {effective_grad.round(3)}   (true = {true_grad})")
print(f"\nThis is a batch of {micro_batches} micro-batches, stepped once.")

per-micro-batch noise is large; the average is close to the truth:
  effective grad = [ 0.79  -2.056]   (true = [ 1. -2.])

This is a batch of 8 micro-batches, stepped once.

In PyTorch, the pattern is: don’t zero_grad() until after the accumulation, and scale the loss so the sum behaves like an average:

opt.zero_grad()
for i, (xb, yb) in enumerate(loader):
    loss = loss_fn(model(xb), yb) / accum_steps   # scale so grads average
    loss.backward()                               # ADDS to .grad
    if (i + 1) % accum_steps == 0:
        opt.step()                                # step once per big batch
        opt.zero_grad()

In one breath

Batch size and learning rate are coupled: batch size sets how noisy each gradient is, lr sets how far you step on it.
A bigger batch averages more examples, so the gradient noise shrinks ~1/√(batch) — smoother and more trustworthy, so you can take a bigger step.
Linear scaling rule: multiply the batch by k, multiply the lr by k (the common SGD-style default; some recipes use √k).
Large-batch training needs warmup — the big scaled-up lr overshoots on the random early weights without it.
Gradient accumulation fakes a big batch on a small GPU: run backward() over N micro-batches (grads sum), scale the loss by 1/N, then step once.

Quick check

0/3

Q1According to the linear scaling rule, if you increase batch size from 64 to 256, what should you do to the learning rate?

Q2Why does large-batch training typically need learning-rate warmup?

Q3How does gradient accumulation let you train with a batch larger than fits in GPU memory?

You can now train efficiently on the hardware you have. To go beyond one GPU, distributed training (DDP & FSDP) shows how the batch — and the model itself — gets split across devices.

Batch size ↔ learning rate

What you'll learn

Before you start

Bigger batch = less noisy gradient

The linear scaling rule

Gradient accumulation: big batch on a small GPU

In one breath

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further