What is early stopping, and how does it prevent overfitting?
Early stopping monitors validation loss after each epoch and halts training when it has not improved for a set number of epochs (the patience). It prevents the model from memorising training data past the point of best generalisation, acting as a free regulariser that requires no change to the model or loss function.
How to think about it
Early stopping is the simplest regularisation technique: stop training at the checkpoint where held-out performance is best, rather than running for a fixed number of epochs.
Why training and validation loss diverge
After enough epochs, the model learns patterns specific to the training set that do not generalise. Training loss continues to fall, but validation loss plateaus then rises. The gap between the two curves is a direct measure of overfitting.
Implementation
best_val_loss = float("inf")
patience = 5
strikes = 0
for epoch in range(max_epochs):
train_one_epoch(model, train_loader, optimizer)
val_loss = evaluate(model, val_loader)
if val_loss < best_val_loss:
best_val_loss = val_loss
torch.save(model.state_dict(), "best_checkpoint.pt")
strikes = 0
else:
strikes += 1
if strikes >= patience:
print(f"Early stopping at epoch {epoch}")
break
# Restore the best weights
model.load_state_dict(torch.load("best_checkpoint.pt"))
Patience tuning
Too small a patience stops training during a transient spike in validation loss. Too large and you waste compute. A patience of 5–20 epochs is typical; for learning-rate schedules with warmup, set patience after the warmup period ends.