What is feature leakage and how do you prevent it during feature engineering and preprocessing?
Feature leakage occurs when information from the test set or from the future leaks into training features, making a model appear more accurate than it will be in production. It arises from fitting preprocessing steps on the full dataset, using post-event information as a predictor, or computing aggregates across train-test boundaries. Prevention requires strict pipeline discipline: all stateful transformations must be fit only on training data.
How to think about it
Leakage is one of the most common — and hardest to catch — errors in production ML. A model with leakage looks excellent on offline evaluation and fails in deployment.
Common leakage patterns in feature engineering
Preprocessing fitted on the full dataset. Calling scaler.fit(X) or encoder.fit(X) on the entire dataset before splitting means the scaler knows the test set’s mean, variance, or vocabulary. The test set is no longer truly held out.
Target-derived features computed globally. Computing user_average_spend for each user across the entire dataset and then using it as a feature for predicting that user’s next purchase includes future purchase amounts in the aggregation. In production, you only have past data.
Time-series ordering violated. Creating rolling or lag features after shuffling the data means some training rows include future values as their window.
Label in disguise. A feature that is mechanically derived from or highly correlated with the target — e.g., including claim_paid_amount to predict whether a claim was approved.
Prevention checklist
| Rule | Implementation |
|---|---|
| Stateful transforms fit on train only | Wrap in sklearn.pipeline.Pipeline |
| Temporal features use only past data | Sort by time; split before aggregating |
| Aggregates computed inside CV folds | Use cross_val_score with pipeline |
| Target encoding inside CV | Use category_encoders.TargetEncoder in pipeline |
| No features derived from the label | Audit feature definitions against the target |
The pipeline pattern
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
pipe = Pipeline([
("impute", SimpleImputer(strategy="median")),
("scale", StandardScaler()),
("clf", LogisticRegression()),
])
# cross_val_score re-fits imputer + scaler on each training fold
scores = cross_val_score(pipe, X, y, cv=5, scoring="roc_auc")
Everything inside the pipeline is re-fit per fold. The test fold is only transformed, never used to estimate parameters.
Detecting leakage after the fact
- Suspiciously high CV score relative to known baselines for the domain.
- A feature has unusually high importance in a tree model but makes no business sense.
- Performance drops significantly when you introduce a strict temporal split compared to a random split.