datarekha
Machine Learning Medium Asked at GoogleAsked at AmazonAsked at MicrosoftAsked at Stripe

What are the strategies for handling missing values in a machine learning pipeline, and how do you choose between them?

The short answer

Missing data can be dropped, imputed with a statistic (mean, median, mode), or imputed with a model. The right choice depends on the missing mechanism (MCAR, MAR, MNAR), the fraction of missing data, and the downstream model. Dropping rows is only safe when missingness is rare and random; imputation must always be fit on training data only.

How to think about it

The first question to ask is not how to impute but why values are missing — the mechanism changes what is safe.

Missing mechanisms

MCAR (Missing Completely At Random): missingness is independent of observed and unobserved values. A sensor randomly fails. Simple imputation or row deletion introduces no bias.

MAR (Missing At Random): missingness depends on other observed variables but not on the missing value itself. Income is missing more often for younger respondents. Imputing conditioned on observed variables is safe.

MNAR (Missing Not At Random): missingness depends on the value that is missing. High earners skip income questions. Any simple imputation will bias the model; the fact of missingness is itself a signal.

Strategy selection

ApproachWhen to use
Drop rowsMCAR, <5% missing, large dataset
Mean imputationNumeric, MCAR/MAR, symmetric distribution
Median imputationNumeric with outliers or skew
Mode imputationCategorical
Model-based (KNN, IterativeImputer)MAR with complex structure
Add indicator columnMNAR or when missingness pattern is predictive

Adding a binary feature_was_missing indicator alongside the imputed value lets the model learn whether missingness itself predicts the target — essential for MNAR.

Practical code

from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import GradientBoostingClassifier

num_pipe = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
])
cat_pipe = Pipeline([
    ("impute", SimpleImputer(strategy="most_frequent")),
])

pre = ColumnTransformer([
    ("num", num_pipe, num_cols),
    ("cat", cat_pipe, cat_cols),
])

model = Pipeline([("pre", pre), ("clf", GradientBoostingClassifier())])
model.fit(X_train, y_train)

For complex MAR patterns, replace SimpleImputer with IterativeImputer (multivariate, models each feature as a function of the others).

Tree models and missing values

Gradient-boosted trees (LightGBM, XGBoost with tree_method="hist") can handle NaN natively by learning the optimal direction for missing values at each split. For these models, imputation is optional — but adding a missingness indicator can still boost performance when the pattern is informative.

Keep practising

All Machine Learning questions

Explore further

Skip to content