datarekha

Model Calibration

When your model says 70%, does the thing happen 70% of the time? A model can rank perfectly (great AUC) and still lie about its probabilities. Calibration is the fix.

8 min read Advanced Machine Learning Lesson 16 of 17

What you'll learn

  • Why a model with high AUC can have completely wrong probability outputs
  • How to read a reliability diagram and spot overconfidence
  • Expected Calibration Error (ECE) — how to put a number on miscalibration
  • Platt scaling and isotonic regression — the two standard fixes

Before you start

A spam filter flags 100 emails as 90% likely spam. Your team ships it. A month later, a compliance audit shows that only 70 of those 100 were actually spam. The model ranked them correctly — they really were the most spam-like emails — but the probability it reported was wrong. You built decisions on a number that lied.

That gap between “ranks well” and “probabilities are truthful” is the calibration problem.

Two different questions, one model

Before fixing anything, separate the two things a classifier can be measured on:

Discrimination — can the model tell the junk apart from the good stuff? Does it put real spam near the top and real ham near the bottom? This is what AUC (Area Under the ROC Curve) measures. A model can discriminate perfectly (AUC = 1.0) without its probabilities meaning anything.

Calibration — when the model says “70% chance of spam,” does spam actually show up 70% of the time among all the emails it scores near 0.70? This is about the truthfulness of the probability number, not the ranking.

A model is calibrated when, among all the predictions it makes near probability p, the event actually occurs a fraction p of the time. Said differently: if you collected every email where the model said “probability between 0.65 and 0.75,” exactly 65–75% of them should be spam.

Why calibration matters

If you only care about rank order (top-10 results, AUC on a leaderboard), calibration is irrelevant. But probabilities are load-bearing in three real scenarios:

  1. Threshold decisions — “flag if probability exceeds 0.5.” If the model is systematically overconfident, everything pushes above 0.5 and precision collapses.
  2. Expected-value decisions — “approve this loan if P(repay) times the profit exceeds P(default) times the loss.” Wrong probabilities produce wrong expected values, which produce wrong approvals.
  3. Risk stacking — combining two models’ outputs assumes each probability is honest. Miscalibrated inputs produce compounded errors downstream.

The reliability diagram

The standard visual for calibration is a reliability diagram (also called a calibration plot). Here is how to read one:

  1. Take all predictions on a held-out set.
  2. Sort them into bins by predicted probability — say, ten equal-width bins from 0 to 1.
  3. In each bin, compute two numbers: the mean predicted probability (what the model said, on average) and the observed frequency (what fraction of samples in that bin actually had the positive outcome).
  4. Plot mean-predicted on the x-axis, observed-frequency on the y-axis.

A perfectly calibrated model lies on the diagonal — every point sits at (p, p) because the model said p and the true rate was p. Overconfidence pushes the curve below the diagonal: the model claims 0.80 but the true rate is only 0.50.

0.00.250.500.6250.750.00.250.500.751.0perfectmodelObserved frequencyMean predicted probabilitymodel claims 0.75true rate: 0.50

Reliability diagram. The dashed green diagonal is perfect calibration. The red curve bows below it — this model is overconfident: it says 0.75 but the true rate is only 0.50.

A worked example: 100 spam predictions

You have a batch of 100 emails the model scored between 0.85 and 0.95 (call the bin “near 90% confidence”). Mean predicted probability in that bin: 0.90. Among those 100 emails, 70 were actually spam. Observed frequency: 0.70.

Gap for that bin: |0.90 - 0.70| = 0.20. The model is 20 percentage points overconfident in this range. If you told a compliance officer “these are 90% spam,” you were lying by 20 points.

Expected Calibration Error (ECE)

ECE aggregates the per-bin gap into one number. For each bin b, you multiply the bin’s gap by the fraction of samples in that bin, then sum:

ECE = sum over bins of  (n_b / n_total) * |mean_predicted_b - obs_freq_b|

ECE of 0 means perfect calibration. ECE of 0.20 means the model is off by 0.20 on average, weighted by how many predictions fall in each bin. Lower is always better.

Which models miscalibrate, and how?

  • Modern neural networks are systematically overconfident. The probabilities they output after a softmax are sharper than the data warrants — they peak too high.
  • Random forests and gradient boosting tend to push probabilities toward the extremes (near 0 or near 1) and away from the center.
  • SVMs without Platt scaling output scores, not probabilities at all — the raw output has no calibration guarantee by definition.
  • Logistic regression is usually well-calibrated on the training distribution but can drift if the test distribution shifts.

The two standard fixes

Platt scaling fits a logistic regression on top of the raw model scores using a small held-out calibration set (separate from both training and test). The logistic applies a monotone S-curve that maps overconfident scores into honest probabilities. It works well when miscalibration is a smooth sigmoid-shaped distortion.

Isotonic regression fits a piecewise-constant monotone function instead of the smooth logistic. It is more flexible and handles asymmetric or multi-modal distortions that Platt scaling cannot, but it needs more calibration data (rule of thumb: at least 1000 samples) to avoid overfitting.

Both methods are available in scikit-learn as CalibratedClassifierCV with method='sigmoid' (Platt) or method='isotonic'.

The workflow is always:

  1. Train your base model on the training set.
  2. On a separate held-out calibration set, fit Platt or isotonic on top.
  3. Evaluate the final calibrated probabilities on your test set.

Never calibrate on the training set — you will fit to in-sample overconfidence and overcorrect.

CodePlayground: bin predictions and compute ECE

The playground builds a synthetic overconfident classifier — 300 samples, four bins — and computes the reliability table and ECE from scratch using only numpy.

Every row shows the model claiming a higher number than the data delivers — this is textbook overconfidence. The ECE of 0.257 means the model is wrong by roughly 26 percentage points on average across bins. Platt scaling on a held-out set would bring this number close to zero without retraining the base model.

Quick check

0/3
Q1A logistic regression outputs probabilities. AUC on the test set is 0.91. You then notice ECE is 0.28. What does this tell you?
Q2You bin 500 test predictions and find that among the 80 predictions in the [0.70, 0.80) bin, only 48 events actually occurred. What is the observed frequency and the bin's calibration gap?
Q3A hospital uses a sepsis-risk model to decide when to escalate care. The model has AUC 0.88 but ECE 0.31. A clinician asks: 'If the model says 80% sepsis risk, should I treat it as an 80% probability?' You deploy Platt scaling using a separate 1,200-patient calibration set and ECE drops to 0.04. Which statement best describes what changed — and what did NOT change?

Next

Calibrated probabilities are the foundation for decision-curve analysis, which shows whether acting on a model’s predictions is actually better than a fixed policy — the natural next step once you trust the numbers.

Practice this in an interview

All questions
What is model calibration and how do you measure and fix a poorly calibrated classifier?

A calibrated classifier outputs predicted probabilities that match observed frequencies: if it predicts 0.8 for 1,000 events, roughly 800 should actually occur. Calibration is measured with reliability diagrams (calibration curves) and the Expected Calibration Error (ECE). Poorly calibrated outputs are fixed with Platt scaling (logistic regression on scores) or isotonic regression (non-parametric), applied on a held-out calibration set after the main model is trained.

What is the accuracy paradox and how does it expose the failure of accuracy as a metric?

The accuracy paradox occurs when a trivial model — one that always predicts the majority class — achieves high accuracy on an imbalanced dataset despite having zero predictive power for the minority class. A model that predicts 'not fraud' on every transaction achieves 99.9% accuracy if fraud is 0.1% of the data, but its recall for fraud is zero. Accuracy is only meaningful when classes are roughly balanced.

What are the key regression metrics — MAE, RMSE, MAPE, R² — and what are their failure modes?

MAE, RMSE, MAPE, and R² each measure a different aspect of regression quality and each has a regime where it misleads. RMSE is dominated by outliers; MAE is robust but hides large-error tails; MAPE is undefined at zero and asymmetrically penalises under-prediction; R² can appear high even when absolute errors are large, and can be negative, yet is still commonly misread as a percentage-correct. Choosing the right metric requires knowing the cost structure of the prediction task.

How does CI/CD for ML differ from standard software CI/CD, and what stages should an ML pipeline include?

ML CI/CD must validate not just code correctness but also model quality — automated retraining triggers, data validation, model evaluation gates, and canary deployment checks that standard software pipelines have no equivalent for. A regression in model AUC is as much a deployment failure as a 500 error.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content