datarekha

In a PyTorch training loop, why do you need to call optimizer.zero_grad() before backpropagation?

The short answer

PyTorch accumulates gradients by default, adding new gradients to whatever is already stored in each parameter's .grad. If you do not zero them out each iteration, gradients from previous batches mix with the current batch and corrupt the weight updates. zero_grad() resets gradients to zero so each step uses only the current batch's signal.

How to think about it

PyTorch accumulates gradients by default, adding new gradients to whatever is already stored in each parameter’s .grad. If you do not zero them out each iteration, gradients from previous batches mix with the current batch and corrupt the weight updates. zero_grad() resets gradients to zero so each step uses only the current batch’s signal.

Learn it properly The training loop

Keep practising

All Deep Learning questions

Explore further

Skip to content