datarekha
Machine Learning Hard Asked at GoogleAsked at MetaAsked at UberAsked at AirbnbAsked at Booking.comAsked at Netflix

What are the pitfalls of impurity-based feature importance in tree ensembles, and how do you get a more reliable estimate?

The short answer

Impurity-based importance (mean decrease in impurity) is systematically biased toward high-cardinality and continuous features because they offer more candidate splits. Permutation importance and SHAP values are less biased alternatives that measure actual predictive contribution on held-out data.

How to think about it

Why impurity-based importance is biased

When a tree selects a split, high-cardinality features (e.g., a user ID, a ZIP code, a random float) have many candidate thresholds and are statistically more likely to be chosen by chance — even if they carry no true signal. The importance score accumulates these accidental splits, inflating the apparent importance of such features.

The bias was formally demonstrated by Strobl et al. (2007): in a dataset with one real predictor and several noise predictors of varying cardinalities, impurity importance ranked a high-cardinality noise feature above the true predictor.

Impurity-based importance (what sklearn computes by default)

import pandas as pd
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=300, random_state=0).fit(X_train, y_train)

imp = pd.Series(rf.feature_importances_, index=feature_names).sort_values(ascending=False)
# Biased toward high-cardinality features

Permutation importance

Shuffles each feature column on the validation set and measures the drop in a chosen metric. If a feature is truly useful, scrambling it hurts performance; if it is irrelevant, shuffling changes nothing.

from sklearn.inspection import permutation_importance

result = permutation_importance(
    rf, X_val, y_val,
    n_repeats=10,
    scoring="roc_auc",
    random_state=0
)

perm_imp = pd.Series(
    result.importances_mean, index=feature_names
).sort_values(ascending=False)

SHAP values

SHAP (SHapley Additive exPlanations) decomposes each prediction into per-feature contributions using the Shapley value from cooperative game theory. It handles feature interactions and correlated features more correctly than permutation importance.

import shap

explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_val)
shap.summary_plot(shap_values[1], X_val, feature_names=feature_names)

Which to use

MethodBiasSpeedHandles correlated features
Impurity (MDI)High (cardinality bias)FastNo
PermutationLowModeratePartial
SHAPLowestSlow for large ensemblesYes
Learn it properly Interpretability: SHAP vs LIME

Keep practising

All Machine Learning questions

Explore further

Skip to content