MLOps Hard Asked at StripeAsked at AffirmAsked at UberAsked at Pinterest

A model is live and you cannot get labels quickly. How do you set up alerting to catch performance problems early?

For MLOps Engineer ML Engineer AI / LLM Engineer

The short answer

Without labels, alerting relies on three proxy signal layers: input distribution tests, output score distribution tests, and business proxy metrics. You define thresholds on each layer pre-deployment and set up automated alerts so that degradation triggers investigation before it compounds.

How to think about it

Label-free alerting is standard for fraud, credit, and recommendation systems where outcomes arrive weeks or months after prediction. The key is designing a layered alert stack before deployment — not after an incident.

Step 1: Establish baselines at training time

Before deploying, compute and store reference statistics on the training or recent validation set:

Per-feature mean, standard deviation, p5, p25, p50, p75, p95, null rate.
Output score histogram (binned into 20 equal-width buckets).
PSI baseline (all features score 0 against themselves at deployment time).

These baselines are the ground against which all future distributions are compared.

Step 2: Input distribution alerts

On a rolling window (hourly or daily depending on traffic volume):

Compute PSI for each feature against its baseline. Alert at PSI greater than 0.2.
Run a two-sample KS test on continuous features; alert if p-value is below 0.01 with Bonferroni correction for the number of features tested.
Track null rate per feature; alert if null rate changes by more than 5 absolute percentage points.

Step 3: Output distribution alerts

Compute Jensen-Shannon divergence between current and baseline score histograms. Alert if it exceeds a pre-calibrated threshold (set empirically on historical holdout windows).
Monitor mean predicted probability on a daily rolling window. A sustained directional trend over 3 or more days is worth investigating even if each day’s divergence is within threshold.
Alert if the fraction of predictions above a decision threshold changes by more than a defined percentage — this signals label drift or threshold miscalibration.

Step 4: Business proxy metric alerts

Identify 1 to 3 downstream metrics that react within hours of model quality change: dispute initiation rate, early delinquency, click-through rate, manual review queue depth. Set anomaly detection (e.g., a 3-standard-deviation breach on a rolling 7-day mean) on each.

Step 5: Inter-model disagreement (if a shadow model is running)

Route live traffic to both the current model and a challenger (or a simpler baseline). Alert when the fraction of inputs where the two models disagree by more than a confidence margin exceeds a threshold. High disagreement without any upstream change is a signal that something in the data has shifted.