MLOps Medium Asked at MetaAsked at GoogleAsked at StripeAsked at DoorDash

Why does a model that performed well in offline evaluation degrade in production?

For MLOps Engineer ML Engineer Data Scientist AI / LLM Engineer

The short answer

Production degradation stems from distributional shift between training and serving data, upstream pipeline changes, feedback loops, and the static nature of a trained model against a changing world. Offline evaluation on a held-out slice of historical data cannot simulate these dynamics.

How to think about it

A model is a snapshot of the world at training time. Production is the world as it is right now. Every gap between those two is a potential failure mode.

Root causes

Distributional shift is the broadest category. Covariate shift (P(X) changes), concept drift (P(Y|X) changes), and label shift (P(Y) changes) all break different parts of the model’s assumptions without necessarily raising any immediate alarm.

Train/serve skew is an engineering failure: features computed differently at training time vs. inference time. A feature filled with the training-set mean at training but with the serve-time mean at inference quietly shifts the model’s effective input distribution.

Upstream data changes include schema changes in source tables, a supplier silently changing a data field’s semantics, a new ETL job that truncates strings, or a sensor calibration change. The model’s input space changes without any model code touching.

Feedback loops happen when model predictions influence future training data. A recommendation model trained on clicks that it originally drove will amplify popular items and under-explore the rest; its training distribution drifts toward its own biases.

Software rot covers dependency version bumps, numerical library changes, or infrastructure migrations that subtly alter preprocessing or numeric precision.

Concept drift from the real world is unavoidable: user behaviour evolves, economic conditions shift, competitors launch. The ground truth P(Y|X) is not stationary.

Detecting it in practice

Monitor input feature distributions (PSI, KS test) continuously.
Monitor prediction score distributions — a flat or spiking histogram of output probabilities is a reliable early warning before labels arrive.
Track business proxy metrics (CTR, conversion, churn rate) tightly coupled to model outputs.
Compare model versions on live traffic via shadow deployment or A/B tests.

Why does a model that performed well in offline evaluation degrade in production?

Root causes

Detecting it in practice

Keep practising

Explore further