What metrics should you monitor for a production ML model, and at what layer?
The short answer
Production ML monitoring spans four layers: data quality (schema, distributions, null rates), model behaviour (prediction drift, confidence calibration), operational health (latency, error rate, throughput), and business KPIs (conversion, revenue impact). Each layer has different owners and different alert thresholds.
How to think about it
Monitoring a single accuracy number is not enough — by the time accuracy falls, business damage is already done. Layer your metrics so you catch problems before they cascade.
Layer 1: Data quality
Validate every batch and streaming record before it touches the model.
- Schema checks: expected columns present, correct dtypes, no new unexpected nullables.
- Distribution checks: per-feature mean, standard deviation, p50, p95 — flag deviations beyond 3 standard deviations of a rolling baseline.
- Population Stability Index (PSI): PSI below 0.1 is stable; 0.1 to 0.2 is moderate shift; above 0.2 warrants investigation.
- Null rate and zero rate: a feature whose null rate jumps from 2% to 25% signals an upstream pipeline break.
Layer 2: Model behaviour
- Prediction drift: histogram distance (Jensen-Shannon divergence or PSI) between the current output score distribution and the training-time distribution. A flattening or bimodal spike is a red flag.
- Confidence calibration: is the model producing well-calibrated probabilities or collapsing toward 0 and 1?
- Class-level metrics: when labels do arrive, disaggregate by segment — a model can hold aggregate AUC while degrading sharply for a minority segment.
Layer 3: Operational health
- p50 / p95 / p99 latency on feature retrieval and inference.
- Error rate and timeout rate on the serving endpoint.
- Throughput vs. expected query volume — a sudden drop often means an upstream client is falling back to a default.
Layer 4: Business KPIs
- Click-through rate, conversion rate, churn, or revenue-per-session — whichever metric the model was trained to optimise.
- These are the ground truth of whether the model is helping; all technical metrics are proxies.