Why does training loss keep falling while validation loss rises?
This divergence is the signature of overfitting: the model has enough capacity to memorise training-set specifics — noise, label errors, dataset-specific patterns — that do not generalise. Training loss measures fit to what has already been seen; validation loss measures generalisation to held-out data. As the model memorises rather than learns structure, it scores better on training data and worse on everything else.
How to think about it
Training loss and validation loss start together, then diverge. Understanding exactly why — and what to do — is a core practical skill.
The mechanism
A neural network with enough parameters can fit any finite training set perfectly. Once it has learned the true underlying patterns, continued training starts fitting residual noise: random label variation, measurement error, and distributional quirks unique to the training split. These features do not appear in the validation set, so validation loss climbs even as training loss falls.
The gap between the two curves is a direct measure of variance — the model’s sensitivity to which specific samples it happened to train on.
Diagnostic questions
Is it definitely overfitting? Check whether training accuracy is near 100% while validation accuracy is much lower. If training accuracy is also low, the problem is underfitting (high bias), not overfitting.
How fast is the gap growing? A slowly widening gap is normal early in training. A suddenly steepening validation curve suggests the model has exhausted generalisable signal and is now memorising.
Fixes ranked by ease
| Fix | Mechanism |
|---|---|
| Early stopping | Stop before memorisation takes over |
| Dropout | Stochastic deactivation prevents co-adaptation |
| Weight decay (L2) | Penalises large, specialised weights |
| Data augmentation | Expands effective training set size |
| More training data | Reduces variance directly |
| Simpler model | Reduces capacity available for memorisation |
| Label smoothing | Prevents confident assignment to noisy labels |
# Label smoothing in PyTorch — softens the target distribution
loss_fn = torch.nn.CrossEntropyLoss(label_smoothing=0.1)
When a gap is acceptable
Some gap is always expected — training and validation distributions are never identical. The question is whether the gap is large enough to matter for the deployment use case. For production models, track the gap across training runs and set a threshold beyond which you apply stronger regularisation.