What are the pitfalls of impurity-based feature importance in tree ensembles, and how do you get a more reliable estimate?
Impurity-based importance (mean decrease in impurity) is systematically biased toward high-cardinality and continuous features because they offer more candidate splits. Permutation importance and SHAP values are less biased alternatives that measure actual predictive contribution on held-out data.
How to think about it
Why impurity-based importance is biased
When a tree selects a split, high-cardinality features (e.g., a user ID, a ZIP code, a random float) have many candidate thresholds and are statistically more likely to be chosen by chance — even if they carry no true signal. The importance score accumulates these accidental splits, inflating the apparent importance of such features.
The bias was formally demonstrated by Strobl et al. (2007): in a dataset with one real predictor and several noise predictors of varying cardinalities, impurity importance ranked a high-cardinality noise feature above the true predictor.
Impurity-based importance (what sklearn computes by default)
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=300, random_state=0).fit(X_train, y_train)
imp = pd.Series(rf.feature_importances_, index=feature_names).sort_values(ascending=False)
# Biased toward high-cardinality features
Permutation importance
Shuffles each feature column on the validation set and measures the drop in a chosen metric. If a feature is truly useful, scrambling it hurts performance; if it is irrelevant, shuffling changes nothing.
from sklearn.inspection import permutation_importance
result = permutation_importance(
rf, X_val, y_val,
n_repeats=10,
scoring="roc_auc",
random_state=0
)
perm_imp = pd.Series(
result.importances_mean, index=feature_names
).sort_values(ascending=False)
SHAP values
SHAP (SHapley Additive exPlanations) decomposes each prediction into per-feature contributions using the Shapley value from cooperative game theory. It handles feature interactions and correlated features more correctly than permutation importance.
import shap
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_val)
shap.summary_plot(shap_values[1], X_val, feature_names=feature_names)
Which to use
| Method | Bias | Speed | Handles correlated features |
|---|---|---|---|
| Impurity (MDI) | High (cardinality bias) | Fast | No |
| Permutation | Low | Moderate | Partial |
| SHAP | Lowest | Slow for large ensembles | Yes |