In a PyTorch training loop, why do you need to call optimizer.zero_grad() before backpropagation?

For ML Engineer research-engineer AI / LLM Engineer

The short answer

PyTorch accumulates gradients by default, adding new gradients to whatever is already stored in each parameter's .grad. If you do not zero them out each iteration, gradients from previous batches mix with the current batch and corrupt the weight updates. zero_grad() resets gradients to zero so each step uses only the current batch's signal.

How to think about it

PyTorch accumulates gradients by default, adding new gradients to whatever is already stored in each parameter’s .grad. If you do not zero them out each iteration, gradients from previous batches mix with the current batch and corrupt the weight updates. zero_grad() resets gradients to zero so each step uses only the current batch’s signal.

Learn it properly The training loop

In a PyTorch training loop, why do you need to call optimizer.zero_grad() before backpropagation?

Keep practising

Explore further