Machine Learning Hard Asked at GoogleAsked at MetaAsked at UberAsked at AirbnbAsked at Booking.comAsked at Netflix

What are the pitfalls of impurity-based feature importance in tree ensembles, and how do you get a more reliable estimate?

For Data Scientist ML Engineer AI / LLM Engineer

The short answer

Impurity-based importance (mean decrease in impurity) is systematically biased toward high-cardinality and continuous features because they offer more candidate splits. Permutation importance and SHAP values are less biased alternatives that measure actual predictive contribution on held-out data.

How to think about it

Why impurity-based importance is biased

When a tree selects a split, high-cardinality features (e.g., a user ID, a ZIP code, a random float) have many candidate thresholds and are statistically more likely to be chosen by chance — even if they carry no true signal. The importance score accumulates these accidental splits, inflating the apparent importance of such features.

The bias was formally demonstrated by Strobl et al. (2007): in a dataset with one real predictor and several noise predictors of varying cardinalities, impurity importance ranked a high-cardinality noise feature above the true predictor.

Impurity-based importance (what sklearn computes by default)

import pandas as pd
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=300, random_state=0).fit(X_train, y_train)

imp = pd.Series(rf.feature_importances_, index=feature_names).sort_values(ascending=False)
# Biased toward high-cardinality features

Permutation importance

Shuffles each feature column on the validation set and measures the drop in a chosen metric. If a feature is truly useful, scrambling it hurts performance; if it is irrelevant, shuffling changes nothing.

from sklearn.inspection import permutation_importance

result = permutation_importance(
    rf, X_val, y_val,
    n_repeats=10,
    scoring="roc_auc",
    random_state=0
)

perm_imp = pd.Series(
    result.importances_mean, index=feature_names
).sort_values(ascending=False)

SHAP values

SHAP (SHapley Additive exPlanations) decomposes each prediction into per-feature contributions using the Shapley value from cooperative game theory. It handles feature interactions and correlated features more correctly than permutation importance.

import shap

explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_val)
shap.summary_plot(shap_values[1], X_val, feature_names=feature_names)

Which to use

Method	Bias	Speed	Handles correlated features
Impurity (MDI)	High (cardinality bias)	Fast	No
Permutation	Low	Moderate	Partial
SHAP	Lowest	Slow for large ensembles	Yes

Learn it properly Interpretability: SHAP vs LIME

What are the pitfalls of impurity-based feature importance in tree ensembles, and how do you get a more reliable estimate?

Keep practising

Explore further