datarekha
MLOps Hard Asked at StripeAsked at AffirmAsked at KlarnaAsked at Uber

How do you monitor a model when ground-truth labels are delayed or never arrive?

The short answer

When true labels are unavailable or arrive weeks late, you monitor leading indicators instead: input distribution drift, output score distribution shift, proxy business metrics, and inter-model disagreement. These act as early-warning signals before any labelled evaluation becomes possible.

How to think about it

Label delay is the norm, not the exception. A fraud model labels a transaction as fraud or not only after chargebacks settle — 30 to 90 days later. A loan default model waits months or years. You need monitoring that works without labels.

Tier 1: Input distribution monitoring (no labels needed)

Monitor P(X) using statistical tests per feature: Population Stability Index (PSI), Kolmogorov-Smirnov for continuous features, chi-squared for categoricals. Significant drift in inputs is a leading indicator that model assumptions may be violated before any outcome is observable.

Tier 2: Prediction drift monitoring (no labels needed)

The model’s output distribution P(ŷ) is always available. Track:

  • Score histogram shape — is the distribution bimodal, uniform, or shifted vs. training time?
  • Jensen-Shannon divergence between current and baseline prediction distributions.
  • Mean predicted probability on a rolling window — a sustained upward or downward trend signals label drift even without seeing true labels.

These are proxy signals: output drift does not guarantee accuracy loss, but it is a reliable trigger for investigation.

Tier 3: Proxy business metrics (no labels needed)

Find metrics correlated with model quality that update faster than labels. For a fraud model: dispute rate, manual review queue growth, customer complaint volume. For a recommendation model: immediate click-through rate. For a credit model: early delinquency (30-day late), which precedes true default by months.

Tier 4: Delayed label evaluation

When labels do arrive — even a small sample — run them back against the predictions logged at inference time. Maintain an inference log with timestamps and feature snapshots so every prediction can be evaluated retrospectively.

Tier 5: Human-in-the-loop sampling

Route a random sample of predictions to a human review queue regardless of model confidence. This provides a labelled ground-truth window that is neither cherry-picked nor dependent on outcomes — vital for calibration audits.

Keep practising

All MLOps questions

Explore further

Skip to content