Machine Learning Medium Asked at GoogleAsked at AmazonAsked at MetaAsked at StripeAsked at Uber

What is data leakage in machine learning, and what are the most common ways it occurs?

For Data Scientist ML Engineer AI / LLM Engineer

The short answer

Data leakage happens when information that would not be available at prediction time influences model training, producing overly optimistic evaluation metrics that collapse in production. Common sources include fitting preprocessors on the full dataset, including target-derived features, and using future data in time-series pipelines.

How to think about it

Leakage is insidious because it silently inflates reported metrics while leaving the production model broken. There are two root forms:

1. Preprocessing leakage (train-test contamination)

Fitting a transformer on the entire dataset before splitting lets test-set statistics influence training-set scaling. Even a mean imputer computed over all rows leaks test-set information.

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# WRONG — scaler sees test data
scaler = StandardScaler().fit(X)
X_scaled = scaler.transform(X)
X_train, X_test = train_test_split(X_scaled)

# CORRECT — scaler fit only on train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test  = scaler.transform(X_test)

Use sklearn.pipeline.Pipeline to automate this: the pipeline fits transformers on training data only inside every CV fold.

2. Target leakage (feature leakage)

A feature is derived from or causally downstream of the target, making it unavailable at inference time.

Examples:

Predicting loan default using days_past_due — that field updates after the default event.
Predicting hospital readmission using discharge summary notes that are written after the outcome is known.
Predicting churn using a cancellation_reason column populated only after the customer churns.

3. Temporal leakage in time series

Using future rows to compute lag features, rolling averages, or target-encoding statistics that incorporate future targets.

4. Duplicate / near-duplicate leakage

If near-identical samples (augmented images, duplicated survey responses) span train and test, the model effectively memorizes test examples during training.

Detecting leakage signals:

Suspiciously high cross-validation scores (AUC above 0.99 on real-world messy data).
A single feature driving near-perfect importance.
Model performance drops sharply between offline eval and production.

Learn it properly Data leakage

What is data leakage in machine learning, and what are the most common ways it occurs?

Keep practising

Explore further