datarekha

What is the difference between an epoch, an iteration, and a step in deep learning training?

The short answer

An epoch is one complete pass through the entire training dataset. An iteration (or step) is one forward-backward pass on a single mini-batch. The number of iterations per epoch equals the dataset size divided by the batch size. These distinctions matter when comparing runs with different batch sizes, reporting training progress, and configuring learning rate schedules.

How to think about it

These three terms are often conflated in tutorials, which causes confusion when comparing training runs or debugging schedules.

Precise definitions

Iteration / Step: one forward pass, one backward pass, one optimiser update, using a single mini-batch of size B.

Epoch: enough iterations to have processed every training sample once. For a dataset of N samples and batch size B:

iterations per epoch = ⌈N / B⌉

Total steps = epochs × iterations per epoch.

Why the distinction matters

Schedules and checkpoints are often configured in steps, not epochs — because the meaningful unit of compute is the gradient update, not how many times you looped through the data. If you halve the batch size, you double the iterations per epoch and make twice as many gradient updates per epoch. Comparing two runs by epoch count while batch sizes differ is misleading.

dataset_size  = 50_000
batch_size    = 256
epochs        = 90

steps_per_epoch = -(-dataset_size // batch_size)  # ceiling division = 196
total_steps     = epochs * steps_per_epoch          # 17,640

print(f"Steps per epoch: {steps_per_epoch}")
print(f"Total steps:     {total_steps}")

# PyTorch DataLoader handles this automatically
from torch.utils.data import DataLoader
loader = DataLoader(dataset, batch_size=batch_size, shuffle=True, drop_last=True)
# len(loader) == steps_per_epoch (with drop_last=True)

Practical rules

  • Report validation metrics per epoch for human readability, but configure LR schedules per step for precision.
  • Use drop_last=True in DataLoader when batch normalisation requires a full batch — the last batch might otherwise be size 1 or 2.
  • When using gradient accumulation, distinguish the accumulation step (partial mini-batch forward/backward) from the optimiser step (when optimizer.step() is actually called).

Keep practising

All Deep Learning questions

Explore further

Skip to content