datarekha
Infrastructure June 3, 2026

Training-serving skew: the bug feature stores exist to kill

The model scores 0.95 offline and dies in production. Almost always the cause is one bug: training-serving skew, where the feature the model learned offline is computed differently online — or worse, was joined from the future. Here's the bug, why your holdout set can't catch it, and the three fixes that actually work.

12 min read · by datarekha · feature-storesmlopsdata-leakagepoint-in-time

There is a particular flavor of heartbreak in machine learning. The model hits 0.95 AUC on the holdout set. Cross-validation confirms it. The notebook is clean, the metrics are gorgeous, the deck gets made. You ship. And within a week the dashboard shows the model performing barely above chance — sometimes worse, because now it’s confidently wrong.

Nine times out of ten this is not a modeling problem. It is a data problem with a specific name: training-serving skew. The model learned one feature distribution offline and is being fed a different one online. The single most important thing to understand about this bug is that your standard offline diagnostics cannot see it. The holdout set is computed from the same offline pipeline as the training set, so whatever is wrong is wrong in both. Your validation score is measuring the bug agreeing with itself.

The companion to this post — Feature stores in 2026 — is a survey of who builds feature stores and how Tecton, Feast, and Hopsworks compare. This one is narrower and angrier. It is about the one failure mode that actually justifies reaching for a feature store at all, why it is so good at evading detection, and the three patterns that kill it. And, because not every team needs the platform, it ends with the honest case for not buying one. It is also the bug that ships at launch, before any drift has occurred — the day-one cousin of the slow rot covered in MLOps is a loop.

What skew actually is

Google’s Rules of Machine Learning — Martin Zinkevich’s canonical internal-Google doc, still the best single text on this — defines training-serving skew plainly: “a difference between performance during training and performance during serving.” It names three causes: a discrepancy in how you handle data in the training versus serving pipelines, a change in the data between training and serving time, and a feedback loop between the model and the system it feeds.

The first cause is the one that bites everyone, and it is almost always mundane. Hopsworks formalizes it usefully as online-offline feature skew: a difference between the implementation of a transformation in the online inference path and the corresponding transformation in the training or feature pipeline. Two conditions must both hold for the bug to exist — the feature logic is not DRY across the offline and online paths (it’s implemented twice), and the two implementations have drifted apart. Both are required. That is the whole game, and it tells you the fix before we even get there: if the logic exists in exactly one place, the two copies cannot disagree, because there is only one.

Here is the catalog of how the two copies drift in real systems. None of these are exotic.

SAME FEATURE, TWO CODE PATHS, DIFFERENT VALUESTRAINING PATHSpark job over warehouse historySERVING PATHPython service on the live requestmissing valueimputed as NULL by the warehouse→ sent as 0 by the API clientfillna(median)runs in the training notebook→ absent online; raw value passes throughpurchases_last_30dwindow coded as 30 days offline→ coded as 15 days in the servicenum_transfers_last_daytrailing 24h in the spec→ built as calendar-current-day (00:00→now)category embeddingstatic snapshot at train time→ fetched from a live API at serve time
Every row is a documented production skew pattern. The NULL-vs-0 mismatch and the 30-day window built as 15 come straight from Nubank’s real-time ML guide; the rest are standard cases from Google’s Rules of ML and practitioner write-ups. None of them are visible to a holdout set.

The deepest version of this is structural: the training pipeline computes the feature in Apache Spark, and the low-latency serving service recomputes it in NumPy or hand-written Python because Spark is too slow for a per-request hot path. Two languages, two implementations of “the same” aggregation, maintained by two people. They are identical on the day they ship and they begin drifting the moment either side is edited. This is why Nubank frames skew as fundamentally an organizational problem: “real-time models are usually trained and deployed by different people — data scientists and machine learning engineers, respectively.” The feature spec is what gets lost in the handoff.

The temporal cousin: point-in-time leakage

There is a second, nastier failure mode that lives at the feature-join step, and it is worth separating cleanly from the broader “data leakage” taxonomy — the random-split contamination and post-event target features covered in the data leakage lesson. This is not train/test contamination from a careless split. This is leakage from time — joining a feature value that did not exist yet when the label event happened.

Picture a training set. Each row is a labeled event: a transaction at 14:32 that turned out to be fraud. You want to attach the user’s features as they were at 14:32. The naive instinct is to join the feature table on user_id. But feature tables, as Hopsworks puts it, “are typically updated at different cadences by different data pipelines” — so a join on the entity key alone pulls the latest value of that feature, which may have been computed at 18:00, after the fraud was already flagged and the account frozen. You have just told the model the answer. The feature account_is_frozen is a perfect predictor of fraud — in your training set, because it was set because of the fraud.

The correct join is a point-in-time-correct join, also called an AS-OF join: for each label, start from the label’s timestamp and retrieve the most recent feature value whose own event timestamp is less than or equal to the label timestamp, matched on the entity key plus the timestamp key. The future is structurally excluded.

JOINING A FEATURE TO A LABEL EVENTtime →feature v113:00LABEL EVENT14:32 fraudfeature v218:00 (post-freeze)AS-OF joinfeature ts ≤ label tsnaive key joingrabs latest → leaks future
The AS-OF join (green) takes the feature as it stood at or before the label. The naive key-only join (red) grabs the most recent value — here a post-event snapshot taken after the account was frozen, which is a direct leak of the label.

A close relative is rolling-window leakage: computing a count or aggregate over a window that ends at now (query time) instead of at the example timestamp, so your training rows silently include activity that happened after the label. The rule of thumb that prevents both: retrieve feature values where the feature event timestamp is at or before the example timestamp, and compute windowed aggregates over a window that ends at the example timestamp, never at the current clock.

Why is this so dangerous? Because leakage inflates the training score and the holdout score in lockstep. The leaked feature is just as predictive in your validation fold as in your training fold — it’s the same leak. So the offline number is not just good, it is spectacular, and every offline defense you have waves it through. Google’s Rule #31 warns about exactly this trap in one line: “Beware that if you join data from a table at training and serving time, the data in the table may change.”

A model that actually shipped

The most concrete public example of “great on paper, poor in practice” is the Epic Sepsis Model. Epic’s proprietary early-warning model was deployed across hundreds of US hospitals with a developer-reported AUC in the range of 0.76 to 0.83. In 2021, a University of Michigan team published an external validation in JAMA Internal Medicine (Wong et al., August 1, 2021), covering 38,455 hospitalizations: the model’s actual discriminative performance was an AUC of 0.63 (95% CI 0.62–0.64). At its operating point, sensitivity was 33% and positive predictive value was 12% — meaning it missed two-thirds of sepsis cases and, of the alerts it did fire, fewer than one in eight was a true case.

I am not claiming the entire gap is training-serving skew; healthcare deployment is its own swamp of population shift and label definitions. But it is the canonical cautionary tale of a model whose offline reputation and production reality diverged by a chasm, and documented critiques pin much of its apparent signal on artifacts of how and when data is recorded rather than on physiology. Most concretely, the model used antibiotic-order information as an input — so its score tends to rise after a clinician already suspects sepsis and has acted, a textbook label leak — and missing or late-entered vitals end up carrying signal of their own. The model had learned operational and temporal artifacts of how data gets entered, not clinical reality. That is the leakage failure mode wearing a lab coat: features that correlate with the label in the recorded data for reasons that have nothing to do with the phenomenon, and that do not transfer when you act on them.

The lesson generalizes far beyond medicine. When a feature’s predictive power comes from how and when it was recorded rather than what it measures, you get a model that is brilliant offline and useless the moment it meets a live request.

The three fixes

1. One feature definition, both paths

This is the structural fix, and it is the only one that makes skew impossible rather than merely detectable. Google’s Rule #32 is blunt: re-use code between your training and serving pipelines, because doing so “eliminates a source of training-serving skew.” Hopsworks’ DRY framing says the same thing from the other direction — skew requires two divergent implementations, so if there is exactly one implementation, there is nothing to diverge.

This is the whole reason a feature store can earn its keep. A feature store lets you define a transformation once and consume it from both the offline training job and the online serving API. Uber built the Michelangelo Palette feature store explicitly because, in their words, “training/serving skew can result from ad hoc feature engineering and is extremely hard to debug.” Their mechanism is a dual store kept in sync — offline Hive for bulk training access, online Cassandra for latest-value serving — feeding both paths from the same feature definitions. Around the time of their 2019 QCon.ai/InfoQ talk on Palette, Uber reported the platform hosting over 20,000 shared features (the current number is certainly different; treat that as a 2019-era snapshot).

One honest caveat, because the industry oversells this: a feature store does not automatically eliminate skew. It eliminates skew only when the transformation definition itself is shared. If your store ingests precomputed features but you still hand-write a request-time, on-demand transform in the serving service, you have reintroduced the second implementation and the bug walks right back in. This is precisely the architectural difference the vendor landscape turns on: Feast ingests precomputed features and builds point-in-time-correct training sets but does not own your core transformations, whereas Tecton additionally manages batch, streaming, and on-demand transformations end to end — which is what makes a single shared definition easier to enforce rather than merely offer.

2. Point-in-time-correct joins

For the temporal failure mode, the fix is the AS-OF join described above, applied everywhere you assemble a training set from event-stamped feature tables. This is a feature-engineering discipline first and a tool feature second. You can do it by hand in SQL, and plenty of teams do, but it is error-prone exactly because the failure is silent — a wrong join still produces a number, and the number looks great. Feature stores (Feast, Tecton, Hopsworks, Databricks) provide AS-OF joins and “time travel” so you are not reconstructing the state of the world as of each label with hand-rolled window functions.

One trap inside the fix: do not backfill a training set by re-running today’s feature code over historical raw data. That bakes today’s logic and today’s bugfixes into the past and breaks point-in-time correctness unless your store does genuine time travel over historical feature values. Recomputing history with current code is how teams accidentally launder a leak into their backfill.

3. Online/offline parity tests

When you genuinely cannot share code — different runtimes, latency budgets that forbid Spark on the hot path — Google’s Rule #29 gives the fallback: “save the set of features used at serving time, and then pipe those features to a log to use them at training time.” Log what you actually served, train on that, and the two distributions cannot diverge because they are literally the same vectors. This is the “log-and-wait” pattern.

The active version is a parity test: push the same raw input through both the offline and online feature pipelines and assert the output feature vectors are byte-identical. Run it in CI on every change to either path. A December 2025 practitioner write-up puts the principle sharply — “most model regressions are really feature regressions — offline features and online features that should match but don’t” — and proposes a standard battery of offline/online consistency checks: schemas, joins, windows, leakage, and parity tests. Those five checks catch more incidents than watching model-quality metrics does, because they fire the moment a feature drifts rather than weeks later when the business metric finally sags.

And measure the gap that actually matters. Google’s Rule #37 says to track three separate gaps — training-to-holdout, holdout-to-next-day, and next-day-to-live. The one that reveals true serving skew is next-day-to-live; the others can look perfect while production burns. Nubank adds the operational refinement: monitor feature distributions at percentiles, not just means, because skew loves to hide in the tails while the average stays put. Their recommended rollout is shadow mode — serve the model in parallel, log its outputs, compare against offline before you let it make a single real decision.

When you do NOT need a feature store

Here is the counter-take the vendor decks skip, and it is the honest bookend to the landscape post. A feature store is real operational weight. It earns that weight under three conditions stacked together: you have real-time serving, you have multiple teams sharing features, and you have point-in-time-sensitive labels. Remove any one and the calculus shifts hard toward something simpler.

Xebia’s position — provocatively titled “You Still Don’t Need A Feature Store” — maps cleanly onto the three problems above:

  • If your only problem is train-serve skew on a single model: in-model preprocessing or one shared transform function imported by both paths solves it. That is fix #1 without the platform.
  • If your problem is recomputation cost — you don’t want to recompute expensive features on every request: a precomputed-features table in a plain key-value store (Redis, DynamoDB) is enough. That is a cache, not a feature store.
  • If your problem is duplicated work across the org — people reinventing the same feature: a Git-versioned feature catalog (the definitions live in code, reviewed in PRs) gets you most of the way. That is discipline, not a product.

And the bluntest line of all: batch-only, offline-serving models rarely need a feature store at all. If your model scores a nightly batch and writes predictions to a table, there is no online path, so there is no online-offline skew to eliminate. A versioned feature table plus the same transform function for training and scoring is the entire solution.

What to take away

  • The bug is singular and the symptom is universal. A model that scores beautifully offline and dies online is, until proven otherwise, suffering from training-serving skew or point-in-time leakage. Start there before you touch the model.
  • Your offline metrics are accomplices, not witnesses. Holdout sets and cross-validation are computed from the same pipeline as training, so they inflate right along with the bug. The only honest signal is the next-day-to-live gap and an online/offline parity test.
  • The structural fix is one definition for both paths. Skew is impossible — not just catchable — when the offline and online paths consume the same feature logic. Everything else is mitigation.
  • Temporal correctness is a discipline, not a checkbox. AS-OF joins and windows that end at the example timestamp prevent leakage; a feature store gives you the tooling so you don’t hand-roll it wrong.
  • A feature store is the right answer to a specific, stacked problem. Real-time serving plus multiple teams plus point-in-time-sensitive labels justifies the platform. A single batch model justifies a versioned table and a cache. Buy the fix for your failure mode.

Skew is unglamorous. There is no clever architecture in it, no frontier model, no leaderboard. It is a fillna that lives in one notebook, a join on the wrong key, a window someone typed as 15 instead of 30. But it is the bug that kills more production models than every fancy failure combined, and it is the entire reason the feature store category exists. The teams who ship reliable models are not the ones with the best features — they are the ones who made absolutely certain that the feature the model trained on is the exact same feature it sees in production. That is the whole job.


Further reading: Google’s Rules of Machine Learning (Rules #29, #31, #32, #37) is the canonical primary source on skew. Nubank’s Dealing with Train-serve Skew in Real-time ML Models is the best in-the-trenches bug list. Hopsworks’ dictionary entries on point-in-time-correct joins and online-offline feature skew formalize the mechanics. Xebia’s You Still Don’t Need A Feature Store is the counter-take. And the external validation of the Epic Sepsis Model in JAMA Internal Medicine is the cautionary tale worth reading in full.

Skip to content