What is data leakage in machine learning, and what are the most common ways it occurs?
Data leakage happens when information that would not be available at prediction time influences model training, producing overly optimistic evaluation metrics that collapse in production. Common sources include fitting preprocessors on the full dataset, including target-derived features, and using future data in time-series pipelines.
How to think about it
Leakage is insidious because it silently inflates reported metrics while leaving the production model broken. There are two root forms:
1. Preprocessing leakage (train-test contamination)
Fitting a transformer on the entire dataset before splitting lets test-set statistics influence training-set scaling. Even a mean imputer computed over all rows leaks test-set information.
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# WRONG — scaler sees test data
scaler = StandardScaler().fit(X)
X_scaled = scaler.transform(X)
X_train, X_test = train_test_split(X_scaled)
# CORRECT — scaler fit only on train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
Use sklearn.pipeline.Pipeline to automate this: the pipeline fits transformers on training data only inside every CV fold.
2. Target leakage (feature leakage)
A feature is derived from or causally downstream of the target, making it unavailable at inference time.
Examples:
- Predicting loan default using
days_past_due— that field updates after the default event. - Predicting hospital readmission using discharge summary notes that are written after the outcome is known.
- Predicting churn using a
cancellation_reasoncolumn populated only after the customer churns.
3. Temporal leakage in time series
Using future rows to compute lag features, rolling averages, or target-encoding statistics that incorporate future targets.
4. Duplicate / near-duplicate leakage
If near-identical samples (augmented images, duplicated survey responses) span train and test, the model effectively memorizes test examples during training.
Detecting leakage signals:
- Suspiciously high cross-validation scores (AUC above 0.99 on real-world messy data).
- A single feature driving near-perfect importance.
- Model performance drops sharply between offline eval and production.