What are the strategies for handling missing values in a machine learning pipeline, and how do you choose between them?
Missing data can be dropped, imputed with a statistic (mean, median, mode), or imputed with a model. The right choice depends on the missing mechanism (MCAR, MAR, MNAR), the fraction of missing data, and the downstream model. Dropping rows is only safe when missingness is rare and random; imputation must always be fit on training data only.
How to think about it
The first question to ask is not how to impute but why values are missing — the mechanism changes what is safe.
Missing mechanisms
MCAR (Missing Completely At Random): missingness is independent of observed and unobserved values. A sensor randomly fails. Simple imputation or row deletion introduces no bias.
MAR (Missing At Random): missingness depends on other observed variables but not on the missing value itself. Income is missing more often for younger respondents. Imputing conditioned on observed variables is safe.
MNAR (Missing Not At Random): missingness depends on the value that is missing. High earners skip income questions. Any simple imputation will bias the model; the fact of missingness is itself a signal.
Strategy selection
| Approach | When to use |
|---|---|
| Drop rows | MCAR, <5% missing, large dataset |
| Mean imputation | Numeric, MCAR/MAR, symmetric distribution |
| Median imputation | Numeric with outliers or skew |
| Mode imputation | Categorical |
| Model-based (KNN, IterativeImputer) | MAR with complex structure |
| Add indicator column | MNAR or when missingness pattern is predictive |
Adding a binary feature_was_missing indicator alongside the imputed value lets the model learn whether missingness itself predicts the target — essential for MNAR.
Practical code
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import GradientBoostingClassifier
num_pipe = Pipeline([
("impute", SimpleImputer(strategy="median")),
])
cat_pipe = Pipeline([
("impute", SimpleImputer(strategy="most_frequent")),
])
pre = ColumnTransformer([
("num", num_pipe, num_cols),
("cat", cat_pipe, cat_cols),
])
model = Pipeline([("pre", pre), ("clf", GradientBoostingClassifier())])
model.fit(X_train, y_train)
For complex MAR patterns, replace SimpleImputer with IterativeImputer (multivariate, models each feature as a function of the others).
Tree models and missing values
Gradient-boosted trees (LightGBM, XGBoost with tree_method="hist") can handle NaN natively by learning the optimal direction for missing values at each split. For these models, imputation is optional — but adding a missingness indicator can still boost performance when the pattern is informative.