Training-Serving Skew
Why a model that scores beautifully offline dies the day it goes live. The two mechanisms — online-offline skew and point-in-time leakage — why your holdout set is blind to both, and the three fixes that actually work.
What you'll learn
- Online-offline skew — when one feature is computed by two diverging code paths
- Point-in-time leakage — when a naive key-only join pulls a value from after the label
- Why a holdout set and cross-validation are structurally blind to both
- The AS-OF / point-in-time-correct join that excludes the future by construction
- The three fixes — one shared definition, AS-OF joins, and online/offline parity tests
Before you start
That gap is training-serving skew — Google’s Rules of Machine Learning defines it simply as “a difference between performance during training and performance during serving.” It is the single most common reason an offline star becomes an online dud, and the reason the whole feature-store category exists. This lesson is narrower than the broader data-leakage taxonomy: we focus on the two mechanisms that are specifically about the train path versus the serve path.
Skew is not drift
First, a clean separation, because these get conflated constantly.
Drift is the world changing over time: the model was right at launch, then the input distribution moved. Skew is a mismatch that exists at launch — the model is wrong on the very first live request, because the feature it sees was never the feature it learned. Different cause, different fix. Skew shows up in two flavours.
Flavour 1 — online-offline skew
The same feature is computed by two different code paths: the training pipeline (a Spark job over warehouse history) and the serving service (a Python function on the live request). They produce identical numbers on the day they ship, then drift apart the moment either side is edited.
The bug requires two conditions to both hold, and this is the whole game:
- The feature logic is not shared — it is implemented twice.
- The two implementations have diverged — somebody changed one and not the other.
If the logic lives in exactly one place, the two copies cannot disagree, because there is only one. Hopsworks calls this the DRY framing of skew. Here is what divergence looks like in the wild — none of it exotic:
Flavour 2 — point-in-time leakage
This is the temporal cousin, and it is nastier because it lives at the feature-join step. Picture assembling a training set. Each row is a labeled event — a transaction at 14:32 that turned out to be fraud — and you want the user’s features as they were at 14:32.
The naive instinct is to join the feature table on user_id. But feature tables
get updated on their own cadence, so a join on the entity key alone pulls the
latest value — which may have been computed at 18:00, after the fraud was
flagged and the account frozen. Now account_is_frozen = true sits in your
training row, and it is a perfect predictor of fraud — because it was set
because of the fraud. You have told the model the answer.
The fix is a point-in-time-correct join, also called an AS-OF join: for each label, retrieve the most recent feature value whose own timestamp is less than or equal to the label timestamp, matched on the entity key plus the timestamp key. The future is excluded by construction. The widget below lets you feel exactly why a key-only join leaks and an AS-OF join cannot.
The aha the widget makes physical: the key-only join ignores time entirely and
grabs whatever is latest, so the instant a feature update lands after the label,
it silently reaches into the future. The AS-OF join filters on feature_ts <= label_ts first, so the post-event snapshot is invisible to it — not detected and
discarded, but never a candidate in the first place. A close relative is
rolling-window leakage: computing an aggregate over a window that ends at now
instead of at the example timestamp. Same rule prevents both — every value and
every window must end at or before the label.
Why your holdout set is blind to both
Here is the part that makes skew so dangerous, and it is the heart of the lesson.
A holdout set, k-fold cross-validation, and a careful train/test split are all computed from the offline pipeline. So a bug in that pipeline contaminates the diagnostic exactly as much as the model. The leaked feature is just as predictive in your validation fold as in your training fold — it is the same leak. The offline number is not just good, it is gorgeous, and every offline defense waves it through.
The most famous casualty is the Epic Sepsis Model, deployed across hundreds of US hospitals with a developer-reported AUC of roughly 0.76 to 0.83. A University of Michigan team externally validated it (Wong et al., JAMA Internal Medicine, August 1, 2021) over 38,455 hospitalizations and measured an actual AUC of 0.63 (95% CI 0.62 to 0.64) — at its operating point, sensitivity 33% and positive predictive value 12%. Much of its apparent signal is pinned on operational artifacts: it used antibiotic-order information as an input, so the score tended to rise after a clinician already suspected sepsis — a textbook label leak. The model learned how data gets recorded, not what it measures.
The three fixes
1. One feature definition, both paths. The structural fix — the only one that makes skew impossible rather than merely detectable. Google’s Rule #32 is blunt: re-use code between training and serving, because it “eliminates a source of training-serving skew.” This is what a feature store is for: define a transformation once, consume it from both the offline training job and the online API. One honest caveat — a feature store kills skew only when the transformation definition is shared. If it ingests precomputed features but your service still hand-writes an on-demand transform, the second implementation walks the bug right back in.
2. Point-in-time-correct (AS-OF) joins. For the temporal failure mode, apply the AS-OF join everywhere you assemble a training set from event-stamped tables — the discipline the widget made concrete. One trap inside the fix: do not backfill by re-running today’s feature code over historical raw data. That bakes today’s logic and bugfixes into the past and breaks point-in-time correctness, unless your store does genuine time travel over historical values.
3. Online/offline parity tests, plus log-and-wait. When you genuinely cannot share code — different runtimes, a latency budget that forbids Spark on the hot path — Google’s Rule #29 gives the fallback: log the features you actually served, and train on those exact vectors. They cannot diverge because they are literally the same numbers. The active version is a parity test: push one raw input through both pipelines and assert the output feature vectors are identical, run in CI on every change. And measure the gap that matters — Rule #37 says track training-to-holdout, holdout-to-next-day, and next-day-to-live separately; only the last one reveals true serving skew.
Quick check
Quick check
Going deeper
This lesson is the interactive, pedagogical companion to the narrative blog post on training-serving skew, which carries the full Uber Michelangelo story, the vendor landscape, and the honest case for not buying a feature store. From here: feature stores are the platform built to enforce fix #1; the broader data-leakage taxonomy covers the random-split and target-leak cousins this lesson set aside; and drift is what comes for your model after you have killed skew at launch.
Practice this in an interview
All questionsTrain/serve skew occurs when the feature values a model sees at training time differ from those it sees at inference time, even for the same raw input — caused by divergent preprocessing code paths, different data sources, or temporal leakage. It silently degrades performance without raising obvious errors.
The most common cause is training-serving skew: the distribution of features at serving time differs from the training data. The fix requires instrumenting the pipeline to log serving inputs, compare their distribution to training data, and identify whether the gap is due to data drift, feature engineering bugs, label leakage, or infrastructure inconsistencies.
Production degradation stems from distributional shift between training and serving data, upstream pipeline changes, feedback loops, and the static nature of a trained model against a changing world. Offline evaluation on a held-out slice of historical data cannot simulate these dynamics.
Data leakage happens when information that would not be available at prediction time influences model training, producing overly optimistic evaluation metrics that collapse in production. Common sources include fitting preprocessors on the full dataset, including target-derived features, and using future data in time-series pipelines.