datarekha
Machine Learning Hard Asked at GoogleAsked at AmazonAsked at MetaAsked at StripeAsked at Netflix

What is feature leakage and how do you prevent it during feature engineering and preprocessing?

The short answer

Feature leakage occurs when information from the test set or from the future leaks into training features, making a model appear more accurate than it will be in production. It arises from fitting preprocessing steps on the full dataset, using post-event information as a predictor, or computing aggregates across train-test boundaries. Prevention requires strict pipeline discipline: all stateful transformations must be fit only on training data.

How to think about it

Leakage is one of the most common — and hardest to catch — errors in production ML. A model with leakage looks excellent on offline evaluation and fails in deployment.

Common leakage patterns in feature engineering

Preprocessing fitted on the full dataset. Calling scaler.fit(X) or encoder.fit(X) on the entire dataset before splitting means the scaler knows the test set’s mean, variance, or vocabulary. The test set is no longer truly held out.

Target-derived features computed globally. Computing user_average_spend for each user across the entire dataset and then using it as a feature for predicting that user’s next purchase includes future purchase amounts in the aggregation. In production, you only have past data.

Time-series ordering violated. Creating rolling or lag features after shuffling the data means some training rows include future values as their window.

Label in disguise. A feature that is mechanically derived from or highly correlated with the target — e.g., including claim_paid_amount to predict whether a claim was approved.

Prevention checklist

RuleImplementation
Stateful transforms fit on train onlyWrap in sklearn.pipeline.Pipeline
Temporal features use only past dataSort by time; split before aggregating
Aggregates computed inside CV foldsUse cross_val_score with pipeline
Target encoding inside CVUse category_encoders.TargetEncoder in pipeline
No features derived from the labelAudit feature definitions against the target

The pipeline pattern

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

pipe = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("scale",  StandardScaler()),
    ("clf",    LogisticRegression()),
])

# cross_val_score re-fits imputer + scaler on each training fold
scores = cross_val_score(pipe, X, y, cv=5, scoring="roc_auc")

Everything inside the pipeline is re-fit per fold. The test fold is only transformed, never used to estimate parameters.

Detecting leakage after the fact

  • Suspiciously high CV score relative to known baselines for the domain.
  • A feature has unusually high importance in a tree model but makes no business sense.
  • Performance drops significantly when you introduce a strict temporal split compared to a random split.
Learn it properly Data leakage

Keep practising

All Machine Learning questions

Explore further

Skip to content