What is log loss and why does it penalise confident wrong predictions more than uncertain ones?
Log loss (cross-entropy loss) measures how well a model's predicted probabilities match the true labels: it is the negative log-likelihood of the correct class. It penalises confident wrong predictions severely because log(p) approaches negative infinity as p approaches zero — predicting 0.99 for the wrong class incurs roughly 100x the penalty of predicting 0.6 for the wrong class. A perfect model achieves 0; a random binary classifier achieves ln(2) ≈ 0.693.
How to think about it
Give the formula, the asymmetric-penalty intuition, then practical guidance on interpreting and improving log loss.
The formula
For N samples and binary labels:
Log Loss = -(1/N) * sum[ y * log(p) + (1 - y) * log(1 - p) ]
Where y is the true label (0 or 1) and p is the predicted probability of the positive class.
For multi-class with K classes and one-hot true label y_ik:
Log Loss = -(1/N) * sum_i sum_k [ y_ik * log(p_ik) ]
Why confident wrong predictions hurt so much
The contribution per sample when the true label is 1 is -log(p):
| Predicted p | -log(p) |
|---|---|
| 0.99 | 0.01 |
| 0.70 | 0.36 |
| 0.50 | 0.69 |
| 0.10 | 2.30 |
| 0.01 | 4.61 |
Predicting 0.01 when the true label is 1 contributes 4.61 — over 460x the penalty of a correct confident prediction (0.01). This non-linear punishment forces the model to produce well-calibrated probabilities rather than extreme scores.
Interpreting log loss values
- 0.0 — perfect predictions (never achievable with real noise).
- 0.693 (≈ ln 2) — equivalent to random guessing on a balanced binary problem.
- Values near the base rate — model is barely better than always predicting the base rate.
On Kaggle and industry benchmarks, improvement in the third decimal place often corresponds to meaningful improvements in downstream decisions.
Log loss vs. accuracy
Accuracy uses hard predictions; log loss uses probabilities. A model can achieve high accuracy while being poorly calibrated (e.g., predicting 0.51 vs. 0.99 for the same class). Log loss rewards calibration — essential for any downstream use of predicted probabilities such as risk scoring, expected-value computation, or threshold selection.
How to improve log loss
- Calibrate probabilities post hoc using Platt scaling or isotonic regression.
- Avoid letting models output extreme probabilities (0 or 1) — regularisation and larger training sets help.
- Use temperature scaling on neural-network softmax outputs.