datarekha
MLOps Hard Asked at GoogleAsked at MetaAsked at StripeAsked at Booking.com

What is train/serve skew and how do you prevent it?

The short answer

Train/serve skew occurs when the feature values a model sees at training time differ from those it sees at inference time, even for the same raw input — caused by divergent preprocessing code paths, different data sources, or temporal leakage. It silently degrades performance without raising obvious errors.

How to think about it

Train/serve skew is one of the most common and hardest-to-debug sources of production degradation because the model code is unchanged — the damage is in the data pipeline.

How it happens

Dual code paths are the most common cause. A data scientist writes a Pandas pipeline to generate features for training. A backend engineer re-implements the “same” logic in Java or a SQL query for real-time serving. Subtle differences — different rounding, different null handling, different bucketing edges — shift every feature vector without raising an exception.

Different data sources mean training reads from a historical warehouse snapshot while serving reads from a live OLTP database. Schema changes, timezone handling, or reprocessed historical data create silent divergence.

Temporal leakage in training is the inverse problem: training accidentally includes information not available at inference time (a column derived from future events, a join that pulls in post-event data). The model learns something it cannot possibly know at serve time, so live performance collapses.

Feature store staleness is a subtler variant: the feature store serves stale cached values for high-cardinality entities while training used freshly computed values. A user’s “last 7-day spend” at training time was recomputed daily; at serving time it’s cached every 6 hours.

Prevention strategies

Shared feature definitions — write transformations once in a framework (Feast, Tecton, Vertex Feature Store) consumed by both the training pipeline and the serving path. One code path, no divergence.

Training-serving consistency checks — log a sample of live feature vectors and compare their distributions against the training set. Run this as a CI gate before each model promotion.

Point-in-time correct joins — when building training data from event logs, join labels only to features that were available at the event timestamp. Most feature stores support this natively.

Schema and type pinning — fix feature schemas in a registry; fail the serving pipeline loudly on any schema mismatch rather than silently coercing types.

Keep practising

All MLOps questions

Explore further

Skip to content