What is model calibration and how do you measure and fix a poorly calibrated classifier?
A calibrated classifier outputs predicted probabilities that match observed frequencies: if it predicts 0.8 for 1,000 events, roughly 800 should actually occur. Calibration is measured with reliability diagrams (calibration curves) and the Expected Calibration Error (ECE). Poorly calibrated outputs are fixed with Platt scaling (logistic regression on scores) or isotonic regression (non-parametric), applied on a held-out calibration set after the main model is trained.
How to think about it
Define calibration, explain why common models are systematically miscalibrated, show how to measure and fix it.
What calibration means
A model is calibrated if, across all samples where it predicts probability p, the actual positive rate is also p. Formally, P(Y=1 | f(x) = p) = p for all p in [0,1].
Calibration is separate from discrimination (AUC). A model can rank positives above negatives perfectly (AUC = 1.0) yet be wildly miscalibrated (all predicted probabilities between 0.4 and 0.6 regardless of true risk).
Why models are miscalibrated
- Random forests tend to push probabilities away from 0 and 1 (toward 0.2–0.8) because each tree votes and votes are averaged. They are typically under-confident.
- Gradient boosted trees (XGBoost, LightGBM) tend toward overconfidence on the edges.
- SVMs produce decision values, not probabilities, so Platt scaling is always required if you need probabilities.
- Neural networks often output overconfident softmax probabilities, especially after heavy training.
Measuring calibration
Reliability diagram (calibration curve): bin predictions by predicted probability, compute actual positive rate per bin, plot observed vs. predicted. A perfectly calibrated model lies on the diagonal y = x.
Expected Calibration Error (ECE):
ECE = sum_b |B_b| / N * |acc(B_b) - conf(B_b)|
Where bins are equal-width or equal-frequency intervals, |B_b| is the number of samples in bin b, acc(B_b) is the actual fraction of positives, and conf(B_b) is the mean predicted probability.
Fixing calibration
Platt scaling fits a logistic regression — P(y=1) = 1 / (1 + exp(A*f + B)) — on a held-out calibration set using the raw model scores as input. Fast and works well when miscalibration is roughly sigmoid-shaped.
Isotonic regression is a non-parametric monotonic fit between scores and observed positive rates. More flexible but requires more calibration data (typically at least 1,000 samples per class).
Temperature scaling divides neural-network logits by a single learned scalar T before softmax — effective and preserves accuracy.
Critical protocol
Always calibrate on a separate held-out set not used for training or validation. Calibrating on training data will overfit the calibration, and calibrating on the validation set biases model selection.