Feature selection
Fewer features can beat more — less overfitting, faster models, clearer explanations. Filter, wrapper, and embedded methods, plus permutation importance, for choosing what to keep.
What you'll learn
- The three families — filter, wrapper, and embedded selection
- Permutation importance — a model-agnostic way to rank features
- Why fewer features often generalize better (and the leakage trap)
Before you start
More features feel like more information, but past a point they hurt: every useless column adds noise, invites overfitting, slows the model, and clouds the explanation. Feature selection is choosing the subset that actually carries signal — and on real tabular problems, a smaller, well-chosen set often beats throwing everything in.
Three families
There are three ways to decide what to keep, trading speed for thoroughness:
- Filter — score each feature independently of any model (correlation with the target, chi-squared, mutual information) and keep the top ones. Fast and model-agnostic, but blind to feature interactions (two weak features that are strong together get dropped).
- Wrapper — actually train the model on different feature subsets and keep the best. Recursive Feature Elimination (RFE) repeatedly drops the weakest feature and refits. Accurate but expensive (many model fits).
- Embedded — selection happens inside training. L1/Lasso regularization drives useless coefficients to exactly zero; tree importances rank features as a byproduct. Efficient and usually the practical default.
Permutation importance — the honest ranker
A model-agnostic favorite: permutation importance measures how much the score drops when you randomly shuffle one feature’s values. A big drop means the model relied on it; no drop means it was dead weight. Unlike a tree’s built-in importance, it’s measured on held-out data and works for any model.
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.feature_selection import RFE
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# 20 features, but only 5 are informative.
X, y = make_classification(n_samples=1500, n_features=20, n_informative=5, random_state=0)
Xtr, Xte, ytr, yte = train_test_split(X, y, random_state=0)
rf = RandomForestClassifier(n_estimators=200, random_state=0).fit(Xtr, ytr)
imp = permutation_importance(rf, Xte, yte, n_repeats=10, random_state=0).importances_mean
top = np.argsort(imp)[::-1][:6]
print("top features by permutation importance:", top.tolist())
# RFE (wrapper) — keep the 5 best as judged by the model itself
keep = np.where(RFE(rf, n_features_to_select=5).fit(Xtr, ytr).support_)[0]
print("RFE selected 5 features:", keep.tolist())
print("\nBoth recover the ~5 truly informative features out of 20.")
Press Run to executeQuick check
Quick check
Next
To combine features instead of dropping them, see PCA; to explain the survivors, see SHAP and interpretability.
Practice this in an interview
All questionsFilter methods score features by statistical relevance to the target independently of any model, so they're fast but ignore feature interactions. Wrapper methods (like recursive feature elimination) search subsets by training a model and evaluating performance, which is accurate but computationally expensive. Embedded methods select features as part of model training (like lasso or tree importances), giving a good balance of accuracy and efficiency.
Filter methods score features independently of the model using statistics like mutual information or correlation; they are fast but ignore feature interactions. Wrapper methods search subsets by actually training the model, finding better subsets at high computational cost. Embedded methods perform selection during training — LASSO and tree-based feature importances are the most common — offering a balance of quality and speed.
Feature selection keeps a subset of the original features and discards the rest, so the surviving features stay interpretable. Dimensionality reduction like PCA creates new features that are combinations of the originals, compressing information but losing direct interpretability. Choose feature selection when you need to explain which inputs matter, and PCA when you mainly need a compact representation and don't need named features.
Feature engineering is creating, transforming, or selecting input variables so a model can capture patterns more easily. Common techniques include scaling, encoding categoricals, binning, interaction and ratio features, date/time decomposition, and domain-derived aggregates. It often matters more than the choice of algorithm because models can only learn from the signal present in their inputs.