A model is live and you cannot get labels quickly. How do you set up alerting to catch performance problems early?
Without labels, alerting relies on three proxy signal layers: input distribution tests, output score distribution tests, and business proxy metrics. You define thresholds on each layer pre-deployment and set up automated alerts so that degradation triggers investigation before it compounds.
How to think about it
Label-free alerting is standard for fraud, credit, and recommendation systems where outcomes arrive weeks or months after prediction. The key is designing a layered alert stack before deployment — not after an incident.
Step 1: Establish baselines at training time
Before deploying, compute and store reference statistics on the training or recent validation set:
- Per-feature mean, standard deviation, p5, p25, p50, p75, p95, null rate.
- Output score histogram (binned into 20 equal-width buckets).
- PSI baseline (all features score 0 against themselves at deployment time).
These baselines are the ground against which all future distributions are compared.
Step 2: Input distribution alerts
On a rolling window (hourly or daily depending on traffic volume):
- Compute PSI for each feature against its baseline. Alert at PSI greater than 0.2.
- Run a two-sample KS test on continuous features; alert if p-value is below 0.01 with Bonferroni correction for the number of features tested.
- Track null rate per feature; alert if null rate changes by more than 5 absolute percentage points.
Step 3: Output distribution alerts
- Compute Jensen-Shannon divergence between current and baseline score histograms. Alert if it exceeds a pre-calibrated threshold (set empirically on historical holdout windows).
- Monitor mean predicted probability on a daily rolling window. A sustained directional trend over 3 or more days is worth investigating even if each day’s divergence is within threshold.
- Alert if the fraction of predictions above a decision threshold changes by more than a defined percentage — this signals label drift or threshold miscalibration.
Step 4: Business proxy metric alerts
Identify 1 to 3 downstream metrics that react within hours of model quality change: dispute initiation rate, early delinquency, click-through rate, manual review queue depth. Set anomaly detection (e.g., a 3-standard-deviation breach on a rolling 7-day mean) on each.
Step 5: Inter-model disagreement (if a shadow model is running)
Route live traffic to both the current model and a challenger (or a simpler baseline). Alert when the fraction of inputs where the two models disagree by more than a confidence margin exceeds a threshold. High disagreement without any upstream change is a signal that something in the data has shifted.