Model Calibration
When your model says 70%, does the thing happen 70% of the time? A model can rank perfectly (great AUC) and still lie about its probabilities. Calibration is the fix.
What you'll learn
- Why a model with high AUC can have completely wrong probability outputs
- How to read a reliability diagram and spot overconfidence
- Expected Calibration Error (ECE) — how to put a number on miscalibration
- Platt scaling and isotonic regression — the two standard fixes
Before you start
A spam filter flags 100 emails as 90% likely spam. Your team ships it. A month later, a compliance audit shows that only 70 of those 100 were actually spam. The model ranked them correctly — they really were the most spam-like emails — but the probability it reported was wrong. You built decisions on a number that lied.
That gap between “ranks well” and “probabilities are truthful” is the calibration problem.
Two different questions, one model
Before fixing anything, separate the two things a classifier can be measured on:
Discrimination — can the model tell the junk apart from the good stuff? Does it put real spam near the top and real ham near the bottom? This is what AUC (Area Under the ROC Curve) measures. A model can discriminate perfectly (AUC = 1.0) without its probabilities meaning anything.
Calibration — when the model says “70% chance of spam,” does spam actually show up 70% of the time among all the emails it scores near 0.70? This is about the truthfulness of the probability number, not the ranking.
A model is calibrated when, among all the predictions it makes near probability p,
the event actually occurs a fraction p of the time. Said differently: if you
collected every email where the model said “probability between 0.65 and 0.75,” exactly
65–75% of them should be spam.
Why calibration matters
If you only care about rank order (top-10 results, AUC on a leaderboard), calibration is irrelevant. But probabilities are load-bearing in three real scenarios:
- Threshold decisions — “flag if probability exceeds 0.5.” If the model is systematically overconfident, everything pushes above 0.5 and precision collapses.
- Expected-value decisions — “approve this loan if
P(repay)times the profit exceedsP(default)times the loss.” Wrong probabilities produce wrong expected values, which produce wrong approvals. - Risk stacking — combining two models’ outputs assumes each probability is honest. Miscalibrated inputs produce compounded errors downstream.
The reliability diagram
The standard visual for calibration is a reliability diagram (also called a calibration plot). Here is how to read one:
- Take all predictions on a held-out set.
- Sort them into bins by predicted probability — say, ten equal-width bins from 0 to 1.
- In each bin, compute two numbers: the mean predicted probability (what the model said, on average) and the observed frequency (what fraction of samples in that bin actually had the positive outcome).
- Plot mean-predicted on the x-axis, observed-frequency on the y-axis.
A perfectly calibrated model lies on the diagonal — every point sits at (p, p)
because the model said p and the true rate was p. Overconfidence pushes the
curve below the diagonal: the model claims 0.80 but the true rate is only 0.50.
Reliability diagram. The dashed green diagonal is perfect calibration. The red curve bows below it — this model is overconfident: it says 0.75 but the true rate is only 0.50.
A worked example: 100 spam predictions
You have a batch of 100 emails the model scored between 0.85 and 0.95 (call the bin “near 90% confidence”). Mean predicted probability in that bin: 0.90. Among those 100 emails, 70 were actually spam. Observed frequency: 0.70.
Gap for that bin: |0.90 - 0.70| = 0.20. The model is 20 percentage points overconfident
in this range. If you told a compliance officer “these are 90% spam,” you were lying
by 20 points.
Expected Calibration Error (ECE)
ECE aggregates the per-bin gap into one number. For each bin b, you multiply the bin’s
gap by the fraction of samples in that bin, then sum:
ECE = sum over bins of (n_b / n_total) * |mean_predicted_b - obs_freq_b|
ECE of 0 means perfect calibration. ECE of 0.20 means the model is off by 0.20 on average, weighted by how many predictions fall in each bin. Lower is always better.
Which models miscalibrate, and how?
- Modern neural networks are systematically overconfident. The probabilities they output after a softmax are sharper than the data warrants — they peak too high.
- Random forests and gradient boosting tend to push probabilities toward the extremes (near 0 or near 1) and away from the center.
- SVMs without Platt scaling output scores, not probabilities at all — the raw output has no calibration guarantee by definition.
- Logistic regression is usually well-calibrated on the training distribution but can drift if the test distribution shifts.
The two standard fixes
Platt scaling fits a logistic regression on top of the raw model scores using a small held-out calibration set (separate from both training and test). The logistic applies a monotone S-curve that maps overconfident scores into honest probabilities. It works well when miscalibration is a smooth sigmoid-shaped distortion.
Isotonic regression fits a piecewise-constant monotone function instead of the smooth logistic. It is more flexible and handles asymmetric or multi-modal distortions that Platt scaling cannot, but it needs more calibration data (rule of thumb: at least 1000 samples) to avoid overfitting.
Both methods are available in scikit-learn as CalibratedClassifierCV with
method='sigmoid' (Platt) or method='isotonic'.
The workflow is always:
- Train your base model on the training set.
- On a separate held-out calibration set, fit Platt or isotonic on top.
- Evaluate the final calibrated probabilities on your test set.
Never calibrate on the training set — you will fit to in-sample overconfidence and overcorrect.
CodePlayground: bin predictions and compute ECE
The playground builds a synthetic overconfident classifier — 300 samples, four bins — and computes the reliability table and ECE from scratch using only numpy.
Every row shows the model claiming a higher number than the data delivers — this is textbook overconfidence. The ECE of 0.257 means the model is wrong by roughly 26 percentage points on average across bins. Platt scaling on a held-out set would bring this number close to zero without retraining the base model.
Quick check
Next
Calibrated probabilities are the foundation for decision-curve analysis, which shows whether acting on a model’s predictions is actually better than a fixed policy — the natural next step once you trust the numbers.
Practice this in an interview
All questionsA calibrated classifier outputs predicted probabilities that match observed frequencies: if it predicts 0.8 for 1,000 events, roughly 800 should actually occur. Calibration is measured with reliability diagrams (calibration curves) and the Expected Calibration Error (ECE). Poorly calibrated outputs are fixed with Platt scaling (logistic regression on scores) or isotonic regression (non-parametric), applied on a held-out calibration set after the main model is trained.
The accuracy paradox occurs when a trivial model — one that always predicts the majority class — achieves high accuracy on an imbalanced dataset despite having zero predictive power for the minority class. A model that predicts 'not fraud' on every transaction achieves 99.9% accuracy if fraud is 0.1% of the data, but its recall for fraud is zero. Accuracy is only meaningful when classes are roughly balanced.
MAE, RMSE, MAPE, and R² each measure a different aspect of regression quality and each has a regime where it misleads. RMSE is dominated by outliers; MAE is robust but hides large-error tails; MAPE is undefined at zero and asymmetrically penalises under-prediction; R² can appear high even when absolute errors are large, and can be negative, yet is still commonly misread as a percentage-correct. Choosing the right metric requires knowing the cost structure of the prediction task.
ML CI/CD must validate not just code correctness but also model quality — automated retraining triggers, data validation, model evaluation gates, and canary deployment checks that standard software pipelines have no equivalent for. A regression in model AUC is as much a deployment failure as a 500 error.