What is the difference between an epoch, an iteration, and a step in deep learning training?
An epoch is one complete pass through the entire training dataset. An iteration (or step) is one forward-backward pass on a single mini-batch. The number of iterations per epoch equals the dataset size divided by the batch size. These distinctions matter when comparing runs with different batch sizes, reporting training progress, and configuring learning rate schedules.
How to think about it
These three terms are often conflated in tutorials, which causes confusion when comparing training runs or debugging schedules.
Precise definitions
Iteration / Step: one forward pass, one backward pass, one optimiser update, using a single mini-batch of size B.
Epoch: enough iterations to have processed every training sample once. For a dataset of N samples and batch size B:
iterations per epoch = ⌈N / B⌉
Total steps = epochs × iterations per epoch.
Why the distinction matters
Schedules and checkpoints are often configured in steps, not epochs — because the meaningful unit of compute is the gradient update, not how many times you looped through the data. If you halve the batch size, you double the iterations per epoch and make twice as many gradient updates per epoch. Comparing two runs by epoch count while batch sizes differ is misleading.
dataset_size = 50_000
batch_size = 256
epochs = 90
steps_per_epoch = -(-dataset_size // batch_size) # ceiling division = 196
total_steps = epochs * steps_per_epoch # 17,640
print(f"Steps per epoch: {steps_per_epoch}")
print(f"Total steps: {total_steps}")
# PyTorch DataLoader handles this automatically
from torch.utils.data import DataLoader
loader = DataLoader(dataset, batch_size=batch_size, shuffle=True, drop_last=True)
# len(loader) == steps_per_epoch (with drop_last=True)
Practical rules
- Report validation metrics per epoch for human readability, but configure LR schedules per step for precision.
- Use
drop_last=Truein DataLoader when batch normalisation requires a full batch — the last batch might otherwise be size 1 or 2. - When using gradient accumulation, distinguish the accumulation step (partial mini-batch forward/backward) from the optimiser step (when
optimizer.step()is actually called).