datarekha
Deep Learning Easy Asked at GoogleAsked at MetaAsked at AmazonAsked at Apple

Why does training loss keep falling while validation loss rises?

The short answer

This divergence is the signature of overfitting: the model has enough capacity to memorise training-set specifics — noise, label errors, dataset-specific patterns — that do not generalise. Training loss measures fit to what has already been seen; validation loss measures generalisation to held-out data. As the model memorises rather than learns structure, it scores better on training data and worse on everything else.

How to think about it

Training loss and validation loss start together, then diverge. Understanding exactly why — and what to do — is a core practical skill.

The mechanism

A neural network with enough parameters can fit any finite training set perfectly. Once it has learned the true underlying patterns, continued training starts fitting residual noise: random label variation, measurement error, and distributional quirks unique to the training split. These features do not appear in the validation set, so validation loss climbs even as training loss falls.

The gap between the two curves is a direct measure of variance — the model’s sensitivity to which specific samples it happened to train on.

Diagnostic questions

Is it definitely overfitting? Check whether training accuracy is near 100% while validation accuracy is much lower. If training accuracy is also low, the problem is underfitting (high bias), not overfitting.

How fast is the gap growing? A slowly widening gap is normal early in training. A suddenly steepening validation curve suggests the model has exhausted generalisable signal and is now memorising.

Fixes ranked by ease

FixMechanism
Early stoppingStop before memorisation takes over
DropoutStochastic deactivation prevents co-adaptation
Weight decay (L2)Penalises large, specialised weights
Data augmentationExpands effective training set size
More training dataReduces variance directly
Simpler modelReduces capacity available for memorisation
Label smoothingPrevents confident assignment to noisy labels
# Label smoothing in PyTorch — softens the target distribution
loss_fn = torch.nn.CrossEntropyLoss(label_smoothing=0.1)

When a gap is acceptable

Some gap is always expected — training and validation distributions are never identical. The question is whether the gap is large enough to matter for the deployment use case. For production models, track the gap across training runs and set a threshold beyond which you apply stronger regularisation.

Learn it properly Dropout, BN, LN

Keep practising

All Deep Learning questions

Explore further

Skip to content