What is model calibration and how do you measure and fix a poorly calibrated classifier?

A calibrated classifier outputs predicted probabilities that match observed frequencies: if it predicts 0.8 for 1,000 events, roughly 800 should actually occur. Calibration is measured with reliability diagrams (calibration curves) and the Expected Calibration Error (ECE). Poorly calibrated outputs are fixed with Platt scaling (logistic regression on scores) or isotonic regression (non-parametric), applied on a held-out calibration set after the main model is trained.

What is the accuracy paradox and how does it expose the failure of accuracy as a metric?

The accuracy paradox occurs when a trivial model — one that always predicts the majority class — achieves high accuracy on an imbalanced dataset despite having zero predictive power for the minority class. A model that predicts 'not fraud' on every transaction achieves 99.9% accuracy if fraud is 0.1% of the data, but its recall for fraud is zero. Accuracy is only meaningful when classes are roughly balanced.

What are the key regression metrics — MAE, RMSE, MAPE, R² — and what are their failure modes?

MAE, RMSE, MAPE, and R² each measure a different aspect of regression quality and each has a regime where it misleads. RMSE is dominated by outliers; MAE is robust but hides large-error tails; MAPE is undefined at zero and asymmetrically penalises under-prediction; R² can appear high even when absolute errors are large, and can be negative, yet is still commonly misread as a percentage-correct. Choosing the right metric requires knowing the cost structure of the prediction task.

How do you approach anomaly detection, and why is accuracy a bad metric for it?

Anomaly detection finds rare points that deviate from normal patterns, using statistical, distance, density, or model-based methods like isolation forest and one-class SVM, often trained mostly on normal data. Accuracy is misleading because anomalies are extremely rare, so a model that predicts 'normal' for everything scores high accuracy while catching nothing. Use precision, recall, F1, PR-AUC, or ROC-AUC instead, chosen by the cost of false positives vs false negatives.

Model Calibration — Machine Learning

A spam filter flags 100 emails as 90% likely spam. Your team ships it. A month later, a compliance audit shows that only 70 of those 100 were actually spam. The model ranked them correctly — they really were the most spam-like emails — but the probability it reported was wrong. You built decisions on a number that lied.

That gap between “ranks well” and “probabilities are truthful” is the calibration problem.

Two different questions, one model

Before fixing anything, separate the two things a classifier can be measured on:

Discrimination — can the model tell the junk apart from the good stuff? Does it put real spam near the top and real ham near the bottom? This is what AUC (Area Under the ROC Curve) measures. A model can discriminate perfectly (AUC = 1.0) without its probabilities meaning anything.

Calibration — when the model says “70% chance of spam,” does spam actually show up 70% of the time among all the emails it scores near 0.70? This is about the truthfulness of the probability number, not the ranking.

A model is calibrated when, among all the predictions it makes near probability p, the event actually occurs a fraction p of the time. Said differently: if you collected every email where the model said “probability between 0.65 and 0.75,” exactly 65–75% of them should be spam.

Why calibration matters

If you only care about rank order (top-10 results, AUC on a leaderboard), calibration is irrelevant. But probabilities are load-bearing in three real scenarios:

Threshold decisions — “flag if probability exceeds 0.5.” If the model is systematically overconfident, everything pushes above 0.5 and precision collapses.
Expected-value decisions — “approve this loan if P(repay) times the profit exceeds P(default) times the loss.” Wrong probabilities produce wrong expected values, which produce wrong approvals.
Risk stacking — combining two models’ outputs assumes each probability is honest. Miscalibrated inputs produce compounded errors downstream.

The reliability diagram

The standard visual for calibration is a reliability diagram (also called a calibration plot). Here is how to read one:

Take all predictions on a held-out set.
Sort them into bins by predicted probability — say, ten equal-width bins from 0 to 1.
In each bin, compute two numbers: the mean predicted probability (what the model said, on average) and the observed frequency (what fraction of samples in that bin actually had the positive outcome).
Plot mean-predicted on the x-axis, observed-frequency on the y-axis.

A perfectly calibrated model lies on the diagonal — every point sits at (p, p) because the model said p and the true rate was p. Overconfidence pushes the curve below the diagonal: the model claims 0.80 but the true rate is only 0.50.

Reliability diagram. The dashed green diagonal is perfect calibration. The red curve bows below it — this model is overconfident: it says 0.75 but the true rate is only 0.50.

A worked example: 100 spam predictions

You have a batch of 100 emails the model scored between 0.85 and 0.95 (call the bin “near 90% confidence”). Mean predicted probability in that bin: 0.90. Among those 100 emails, 70 were actually spam. Observed frequency: 0.70.

Gap for that bin: |0.90 - 0.70| = 0.20. The model is 20 percentage points overconfident in this range. If you told a compliance officer “these are 90% spam,” you were lying by 20 points.

Expected Calibration Error (ECE)

ECE aggregates the per-bin gap into one number. For each bin b, you multiply the bin’s gap by the fraction of samples in that bin, then sum:

ECE = sum over bins of  (n_b / n_total) * |mean_predicted_b - obs_freq_b|

ECE of 0 means perfect calibration. ECE of 0.20 means the model is off by 0.20 on average, weighted by how many predictions fall in each bin. Lower is always better.

Which models miscalibrate, and how?

Modern neural networks are systematically overconfident. The probabilities they output after a softmax are sharper than the data warrants — they peak too high.
Random forests and gradient boosting tend to push probabilities toward the extremes (near 0 or near 1) and away from the center.
SVMs without Platt scaling output scores, not probabilities at all — the raw output has no calibration guarantee by definition.
Logistic regression is usually well-calibrated on the training distribution but can drift if the test distribution shifts.

The two standard fixes

Platt scaling fits a logistic regression on top of the raw model scores using a small held-out calibration set (separate from both training and test). The logistic applies a monotone S-curve that maps overconfident scores into honest probabilities. It works well when miscalibration is a smooth sigmoid-shaped distortion.

Isotonic regression fits a piecewise-constant monotone function instead of the smooth logistic. It is more flexible and handles asymmetric or multi-modal distortions that Platt scaling cannot, but it needs more calibration data (rule of thumb: at least 1000 samples) to avoid overfitting.

Both methods are available in scikit-learn as CalibratedClassifierCV with method='sigmoid' (Platt) or method='isotonic'.

The workflow is always:

Train your base model on the training set.
On a separate held-out calibration set, fit Platt or isotonic on top.
Evaluate the final calibrated probabilities on your test set.

Never calibrate on the training set — you will fit to in-sample overconfidence and overcorrect.

Worked example: bin predictions and compute ECE

The code below builds a synthetic overconfident classifier — 300 samples, four bins — and computes the reliability table and ECE from scratch using only numpy.

import numpy as np

rng = np.random.default_rng(0)

def make_bin(lo, hi, count, n_pos, rng):
    preds  = rng.uniform(lo, hi, count)
    labels = np.zeros(count, dtype=int)
    labels[:n_pos] = 1
    rng.shuffle(labels)
    return preds, labels

p1, l1 = make_bin(0.10, 0.30,  40,  4, rng)
p2, l2 = make_bin(0.30, 0.50,  80, 16, rng)
p3, l3 = make_bin(0.50, 0.70, 100, 30, rng)
p4, l4 = make_bin(0.70, 0.90,  80, 40, rng)

predicted = np.concatenate([p1, p2, p3, p4])
y         = np.concatenate([l1, l2, l3, l4])
n_total   = len(predicted)

edges      = [0.0, 0.30, 0.50, 0.70, 1.01]
bin_names  = ["[0.00-0.30)", "[0.30-0.50)", "[0.50-0.70)", "[0.70-0.90)"]
ece        = 0.0

print("Bin          | count | mean pred | obs freq | |gap|")
print("-" * 54)

for i in range(len(edges) - 1):
    lo, hi = edges[i], edges[i + 1]
    mask   = (predicted >= lo) & (predicted < hi)
    n      = int(mask.sum())
    if n == 0:
        continue
    mp  = round(float(predicted[mask].mean()), 2)
    of  = round(float(y[mask].mean()),         2)
    gap = round(abs(mp - of),                  2)
    ece += (n / n_total) * abs(mp - of)
    print(bin_names[i] + " |  " + str(n) + "   |      " + str(mp) + " |     " + str(of) + "  |  " + str(gap))

print()
print("ECE = " + str(round(ece, 3)) + "  (0 = perfectly calibrated)")

Bin          | count | mean pred | obs freq | |gap|
------------------------------------------------------
[0.00-0.30) |  40   |      0.21 |     0.1  |  0.11
[0.30-0.50) |  80   |      0.41 |     0.2  |  0.21
[0.50-0.70) |  100   |      0.61 |     0.3  |  0.31
[0.70-0.90) |  80   |      0.81 |     0.5  |  0.31

ECE = 0.257  (0 = perfectly calibrated)

Every row shows the model claiming a higher number than the data delivers — this is textbook overconfidence. The ECE of 0.257 means the model is wrong by roughly 26 percentage points on average across bins. Platt scaling on a held-out set would bring this number close to zero without retraining the base model.

In one breath

A model can rank perfectly (high AUC) and still lie about its probabilities — ranking and truthfulness are two different questions.
Calibrated means: among all predictions near p, the event really happens a fraction p of the time.
Read it on a reliability diagram — points below the diagonal mean overconfidence — and score it with ECE, the per-bin gap weighted by bin size.
Probabilities are load-bearing for threshold, expected-value, and risk-stacking decisions; if they are wrong, those decisions are wrong.
Fix it with Platt scaling (smooth sigmoid) or isotonic regression (flexible, needs ~1000+ samples), fit on a separate held-out set — never on the training data.

Quick check

0/3

Q1A logistic regression outputs probabilities. AUC on the test set is 0.91. You then notice ECE is 0.28. What does this tell you?

Q2You bin 500 test predictions and find that among the 80 predictions in the [0.70, 0.80) bin, only 48 events actually occurred. What is the observed frequency and the bin's calibration gap?

Q3A hospital uses a sepsis-risk model to decide when to escalate care. The model has AUC 0.88 but ECE 0.31. A clinician asks: 'If the model says 80% sepsis risk, should I treat it as an 80% probability?' You deploy Platt scaling using a separate 1,200-patient calibration set and ECE drops to 0.04. Which statement best describes what changed — and what did NOT change?

Calibrated probabilities are the foundation for decision-curve analysis, which shows whether acting on a model’s predictions is actually better than a fixed policy — the natural next step once you trust the numbers.

Model Calibration

What you'll learn

Before you start

Two different questions, one model

Why calibration matters

The reliability diagram

A worked example: 100 spam predictions

Expected Calibration Error (ECE)

Which models miscalibrate, and how?

The two standard fixes

Worked example: bin predictions and compute ECE

In one breath

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further