The silent revenue drop: how drift actually breaks production models

In the third quarter of 2021, Zillow took a $304 million inventory write-down, said it had overvalued the homes it bought by more than half a billion dollars, shut down its iBuying division, and laid off roughly a quarter of its staff. Per The Batch’s reconstruction, the company bought about 9,680 homes and sold 3,032 in that quarter, losing on the order of $80,000 a home. CEO Rich Barton’s public explanation was blunt: “we have been unable to predict future pricing of homes to a level of accuracy that makes this a safe business to be in.”

There was no outage. No pager fired. The valuation model returned well-formed predictions at normal latency the entire time. What broke was the relationship between the features it had learned on and the world it was now pricing — a COVID-era housing market that analysts have since framed as a mix of data drift and concept drift, though that’s the after-the-fact interpretation, not Zillow’s own root-cause language. The model was confidently wrong, and confidently-wrong models don’t throw exceptions. They throw write-downs.

That gap — between “model deployed, dashboards green” and “finance asks why a revenue line is bleeding” — is what this post is about. Not the mechanics of KS, PSI, or Jensen-Shannon (the drift lesson covers the math). The operational reality: what actually happens in the weeks in between, why the genuinely dangerous drift is the hardest to see, why most of your drift alerts are lying to you, and why the correct response to a drift alert is almost never “retrain.”

Two kinds of drift, and only one of them keeps you up at night

The textbook split is real and load-bearing, so worth stating precisely.

Data drift is a change in the input distribution P(X). Your users got younger; a sensor recalibrated; a new market came online and shifted the mean of three features. The inputs moved.

Concept drift is a change in the relationship P(y | X) — the same input now maps to a different answer. Fraudsters found a new pattern; pricing dynamics inverted; the thing you’re predicting started behaving differently for reasons your features don’t capture.

Here is why this distinction is not academic. Data drift you can detect today, from the inputs alone, with no labels — you already have the features. Concept drift you generally cannot, because to observe that P(y | X) has shifted you need y, the ground-truth label, and the label is exactly what arrives late or never.

The asymmetry that defines the problem: the cheap-to-detect drift is mostly harmless; the expensive drift is the one you are structurally blind to until the label confirms the loss.

How late is late? Chip Huyen’s framing of feedback-loop length in Designing Machine Learning Systems makes the range visceral. At one end, an e-commerce recommender gets a “clicked” label in minutes. Fraud detection waits for the dispute window — a month to three months. At the far end sit labels that take years (loan default) or are functionally never observed at all (much of ranking and recommendation).

Sit with the fraud number. If concept drift starts today and your only honest signal is the chargeback label, your accuracy metric does not even begin to move for weeks — and by the time it has moved enough to clear a threshold, you’ve been bleeding for up to a quarter. The metric doesn’t warn you. It writes the post-mortem.

The uncomfortable truth: most drift alerts are noise

The instinct, once you’ve internalized the above, is to wire up per-feature drift detectors and alert on everything. This is the single biggest manufacturer of alert fatigue in production ML, and the reason is subtle: a feature can drift hard without the model getting any worse.

NannyML ran the cleanest demonstration of this I know of, on the public Tetouan City power-consumption dataset. During the production month, univariate drift fired on the two most influential features — temperature and humidity. If you were paging on feature drift, you’d have been woken up for your top predictors. And every single one of those alerts was a false alarm: realized model performance stayed inside its thresholds the entire time. The inputs moved; the model didn’t care, because the relationship it relied on still held.

This is the trap. Univariate KS/PSI-per-feature monitoring ignores inter-feature relationships, is outlier-sensitive, and flags distributional motion that has no bearing on whether your predictions are still correct. Drift in P(X) is a question, not an incident. It tells you to go look; it does not tell you something is broken.

So if the easy-to-measure signal is mostly noise, and the signal that actually matters (performance) is gated behind months of label latency, what do you watch in the meantime? You watch the one thing that’s free, immediate, and label-independent: the model’s own outputs.

The cheapest catch-anything signal: watch your predictions

If you monitor exactly one thing, monitor the distribution of your model’s predictions over time. Evidently puts it about as directly as anyone: “Prediction drift is usually more important than feature drift. If you monitor one thing, look at the outputs,” and “without target values, this is the best proxy of the model behavior.”

The logic is that prediction drift is a strict downstream funnel. For the output distribution to shift, something upstream moved — an input distribution, a feature pipeline, an upstream model — and it moved enough to change the model’s behavior. That last clause is the whole value: prediction drift filters out the input drift that didn’t matter (the Tetouan case) and surfaces the input drift that did, all without a single label.

The canonical catch: a recommender that suddenly pushes the same item to half your user base shows up instantly as a collapsed output distribution. No labels, no waiting for click-through to crater. A fraud model whose positive rate doubles overnight is telling you something changed now, not in three months. It won’t tell you whether the change is good or bad — that still needs labels eventually — but it tells you to look today, which buys you the lead time the label can’t.

Prediction monitoring is cheap, immediate, and catches a startling fraction of real incidents. But it has a blind spot: a model can be quietly, steadily wrong while its output distribution looks completely normal. To close that gap you need an actual performance number — and you need it before the labels show up.

Estimating performance you can’t yet measure

This is where the genuinely clever tooling lives. NannyML (the open-source library, acquired by the data-quality company Soda in an announcement dated June 2025, and committed to staying open and maintained) ships two methods for estimating model performance with no labels at all.

For classification, CBPE — Confidence-Based Performance Estimation — uses the model’s own calibrated probabilities to reconstruct the expected confusion matrix, and from it estimates precision, recall, F1, AUROC, and accuracy. The intuition: a well-calibrated model that says “0.9” is right about 90% of the time, so you can assemble the expected outcomes across a batch of predictions without ever seeing the truth. For regression, DLE — Direct Loss Estimation — trains a second “nanny” model whose target is the per-prediction loss of your “child” model, exploiting the fact that estimating the magnitude of an error is easier than estimating its signed direction.

The research on this family is encouraging. A 2024 benchmark from a Helsinki and NannyML group (analyzing the average-confidence family CBPE belongs to, not a head-to-head on the branded method) reported calibrated estimators hitting roughly 1.3% mean absolute error on accuracy estimation under covariate shift — but with a hard dependency: a 0.980 correlation between calibration error and estimation error. Translation: the estimate is only as good as your calibration. A miscalibrated model produces confident nonsense estimates.

And there’s a caveat so important it’s worth a box, because it’s exactly the failure mode you’d adopt this tooling to avoid.

This is not a reason to skip performance estimation. It’s a reason to know what your estimator can’t see and back it up. The mature setup runs CBPE or DLE as the fast leading indicator, and prediction-distribution monitoring as a second independent lens, and a delayed or sampled ground-truth backstop — because the failure mode CBPE is blind to (concept drift) often does perturb the output distribution, and will eventually hit the sampled labels. No single signal is complete. The discipline is layering them so the gap in one is covered by another.

The real decision is not “detect” — it’s “retrain or not”

Suppose your leading indicators fire and you believe performance is genuinely degrading. The instinct — codified in far too many “MLOps” diagrams as a literal arrow from a drift alert to a retraining job — is to retrain automatically. This instinct is wrong, and it’s wrong in three distinct, expensive ways.

One: auto-retraining ships your upstream bugs into production. At any real organization, the time to diagnose and fix a broken upstream data pipeline routinely exceeds your retraining cadence. So a drift-triggered retrain pulls the corrupted features straight into the new model and bakes the bug in — you don’t fix the problem, you laminate it. Meta found this painful enough to build a system, “Gate,” specifically to block ML retraining on corrupted data partitions; in an Instagram case study it reported a 2.1x average improvement in precision over the baseline by catching bad partitions before they reached training. The whole point of “Moving Fast With Broken Data” is that broken data moves fast too, and unguarded retraining is the vehicle.

Two: scheduled retraining trains on anomalies. Calendar-based retraining implicitly assumes models degrade along a smooth decay curve. Real systems — fraud especially — don’t decay, they get shocked. One analysis documented recall dropping from 0.9375 to 0.7500 in a single week when fraud cases jumped 137% in seven days. A Black Friday spike that roughly doubles transaction volume looks identical to drift on a static threshold; retrain on that window and you’ve taught the model that an anomaly is the new normal. The proposed governance is refreshingly concrete: fit a decay model, and if it holds (R-squared at least 0.4) use scheduled retraining; if it doesn’t, go event-driven — alert when recall drops more than 8% below a four-week rolling mean, and require two consecutive breach weeks before retraining, so a transient seasonal spike can’t trigger a panic retrain.

The medical literature has the cleanest cautionary tale here. A study of perioperative-mortality models during COVID’s first wave watched a deep-learning model’s AUROC fall from 0.942 to 0.914 — and then, crucially, found that retraining it on the transient first-wave data made it worse later, dropping to AUROC 0.747. That’s a clean, real-world proof that retraining on an anomaly is not a neutral act. It can actively damage a model that would have recovered on its own.

Three: retraining is a change to an entangled system, so it must be governed like one. Google’s “High-Interest Credit Card of Technical Debt” gave us CACE — Changing Anything Changes Everything — precisely because ML systems carry hidden feedback loops, undeclared downstream consumers, and data entanglement. A retrain isn’t a refresh button; it’s a deploy that can shift behavior across slices you aren’t even looking at. Which is why the standard governance gate is champion-challenger: the new model (the challenger) runs in shadow on real production traffic with its predictions logged but not served, and it must demonstrate it’s at least as good as the incumbent (the champion) — offline on holdout and online via shadow or gradual rollout — before an alias flip promotes it. Databricks implements exactly this with Unity Catalog “champion”/“challenger” aliases; DataRobot ships its “Challengers.” “Better on the holdout set” is not promotion criteria, because better-on-holdout routinely fails to hold in production.

So the honest decision flow when a leading indicator fires looks nothing like an arrow to a cron job. It looks like a diagnostic: Is this real or seasonal? Is the upstream data clean, or am I about to retrain on garbage? Is this concept drift that retraining can actually fix, or a broken feature pipeline that it can’t? Only after those questions does a challenger get built — and even then it has to earn its promotion in shadow.

As Shafeeq Ur Rahaman puts it, “a well-maintained system is a system where you can tell what is broken, not a system where you simply keep replacing the parts.” Auto-retraining is replacing the parts and hoping.

A note on the drift you can’t control

Everything above assumes tabular drift: a measurable distribution-shift problem on data you own, with metrics like KS and PSI that actually apply. There is a second, stranger kind of drift that this entire toolkit is blind to. When Chen, Zaharia, and Zou measured GPT-4’s prime-number identification accuracy, it collapsed from 97.6% in March 2023 to 2.4% by June 2023 — a silent, vendor-side model change you neither control nor get a changelog for. (That figure drew methodological criticism — some of the shift was format and prompt sensitivity rather than pure capability loss — and it properly belongs to a different conversation.)

The point for this post is the contrast. LLM behavior drift lives in embedding and reasoning space, where a token-frequency KS test on your features is simply blind, and where the cause is often a vendor swapping the model under you. It’s the same word — “drift” — for a fundamentally different engineering problem, and it’s covered in Model monitoring in 2026. The drift in this post is the one you can measure and own. Don’t let the shared vocabulary fool you into thinking one toolkit covers both.

What to take away

The thing that makes drift dangerous is not that it’s hard to detect — KS, PSI, and JS are easy, and the lesson will teach them to you in an afternoon. The thing that makes it dangerous is that it’s silent: no errors, green dashboards, successful deploys, and a damage signal that arrives months after the money is gone. Infrastructure health tells you nothing about whether your predictions are still correct.

Three lines, hard-won from other people’s write-downs:

Build leading indicators, because the lagging one is too late. Watch your prediction distribution first — it’s free, immediate, and catches more than you’d expect. Add label-free performance estimation (CBPE, DLE) for an actual number, but pair it with a delayed ground-truth backstop, because the estimator is blind to concept drift, which is the exact failure you fear most.
Treat data-drift alerts as questions, not incidents. Most are noise. Page on estimated business or performance impact, route raw distributional motion to a dashboard for human triage, and you’ll have an alert that’s actually worth waking someone for.
Make retraining a governed change, not a reflex. Never auto-retrain on a drift alert — you’ll ship an upstream bug or train on a seasonal anomaly. Diagnose first, build a challenger second, and let it prove itself in shadow before any promote. CACE: changing anything changes everything.

The broader lifecycle this sits inside — monitor, diagnose, decide, retrain, redeploy — is the subject of MLOps is a loop, not a pipeline, and the question of how you actually measure model quality well enough to trust these decisions lives in Evals that actually work. Drift is where the loop earns its keep. Zillow had the model. What it didn’t have, in time, was the signal that the model had quietly stopped being right.