datarekha
MLOps Hard Asked at StripeAsked at AffirmAsked at UberAsked at Pinterest

A model is live and you cannot get labels quickly. How do you set up alerting to catch performance problems early?

The short answer

Without labels, alerting relies on three proxy signal layers: input distribution tests, output score distribution tests, and business proxy metrics. You define thresholds on each layer pre-deployment and set up automated alerts so that degradation triggers investigation before it compounds.

How to think about it

Label-free alerting is standard for fraud, credit, and recommendation systems where outcomes arrive weeks or months after prediction. The key is designing a layered alert stack before deployment — not after an incident.

Step 1: Establish baselines at training time

Before deploying, compute and store reference statistics on the training or recent validation set:

  • Per-feature mean, standard deviation, p5, p25, p50, p75, p95, null rate.
  • Output score histogram (binned into 20 equal-width buckets).
  • PSI baseline (all features score 0 against themselves at deployment time).

These baselines are the ground against which all future distributions are compared.

Step 2: Input distribution alerts

On a rolling window (hourly or daily depending on traffic volume):

  • Compute PSI for each feature against its baseline. Alert at PSI greater than 0.2.
  • Run a two-sample KS test on continuous features; alert if p-value is below 0.01 with Bonferroni correction for the number of features tested.
  • Track null rate per feature; alert if null rate changes by more than 5 absolute percentage points.

Step 3: Output distribution alerts

  • Compute Jensen-Shannon divergence between current and baseline score histograms. Alert if it exceeds a pre-calibrated threshold (set empirically on historical holdout windows).
  • Monitor mean predicted probability on a daily rolling window. A sustained directional trend over 3 or more days is worth investigating even if each day’s divergence is within threshold.
  • Alert if the fraction of predictions above a decision threshold changes by more than a defined percentage — this signals label drift or threshold miscalibration.

Step 4: Business proxy metric alerts

Identify 1 to 3 downstream metrics that react within hours of model quality change: dispute initiation rate, early delinquency, click-through rate, manual review queue depth. Set anomaly detection (e.g., a 3-standard-deviation breach on a rolling 7-day mean) on each.

Step 5: Inter-model disagreement (if a shadow model is running)

Route live traffic to both the current model and a challenger (or a simpler baseline). Alert when the fraction of inputs where the two models disagree by more than a confidence margin exceeds a threshold. High disagreement without any upstream change is a signal that something in the data has shifted.

Keep practising

All MLOps questions

Explore further

Skip to content