datarekha

Feature selection

Fewer features can beat more — less overfitting, faster models, clearer explanations. Filter, wrapper, and embedded methods, plus permutation importance, for choosing what to keep.

7 min read Intermediate Machine Learning Lesson 22 of 33

What you'll learn

  • The three families — filter, wrapper, and embedded selection
  • Permutation importance — a model-agnostic way to rank features
  • Why fewer features often generalize better (and the leakage trap)

Before you start

More features feel like more information, but past a point they hurt: every useless column adds noise, invites overfitting, slows the model, and clouds the explanation. Feature selection is choosing the subset that actually carries signal — and on real tabular problems, a smaller, well-chosen set often beats throwing everything in.

Three families

There are three ways to decide what to keep, trading speed for thoroughness:

Filterscore each feature alonecorrelation, chi², mutual infofast · model-agnosticWrappersearch subsets by scoreRFE, forward/backwardaccurate · expensiveEmbeddedselection during trainingL1/Lasso, tree importancebuilt-in · efficient
Filter scores features independently; wrapper searches subsets; embedded selects while the model trains.
  • Filter — score each feature independently of any model (correlation with the target, chi-squared, mutual information) and keep the top ones. Fast and model-agnostic, but blind to feature interactions (two weak features that are strong together get dropped).
  • Wrapper — actually train the model on different feature subsets and keep the best. Recursive Feature Elimination (RFE) repeatedly drops the weakest feature and refits. Accurate but expensive (many model fits).
  • Embedded — selection happens inside training. L1/Lasso regularization drives useless coefficients to exactly zero; tree importances rank features as a byproduct. Efficient and usually the practical default.

Permutation importance — the honest ranker

A model-agnostic favorite: permutation importance measures how much the score drops when you randomly shuffle one feature’s values. A big drop means the model relied on it; no drop means it was dead weight. Unlike a tree’s built-in importance, it’s measured on held-out data and works for any model.

pyPython · scikit-learnscikit-learn · numpy
Ready · ⌘↵
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.feature_selection import RFE
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# 20 features, but only 5 are informative.
X, y = make_classification(n_samples=1500, n_features=20, n_informative=5, random_state=0)
Xtr, Xte, ytr, yte = train_test_split(X, y, random_state=0)

rf = RandomForestClassifier(n_estimators=200, random_state=0).fit(Xtr, ytr)
imp = permutation_importance(rf, Xte, yte, n_repeats=10, random_state=0).importances_mean
top = np.argsort(imp)[::-1][:6]
print("top features by permutation importance:", top.tolist())

# RFE (wrapper) — keep the 5 best as judged by the model itself
keep = np.where(RFE(rf, n_features_to_select=5).fit(Xtr, ytr).support_)[0]
print("RFE selected 5 features:", keep.tolist())
print("\nBoth recover the ~5 truly informative features out of 20.")
Output
Press Run to execute

Quick check

Quick check

0/3
Q1What's the key weakness of filter methods (scoring each feature independently)?
Q2What does permutation importance measure?
Q3Why must feature selection happen inside the cross-validation loop?

Next

To combine features instead of dropping them, see PCA; to explain the survivors, see SHAP and interpretability.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
Compare filter, wrapper, and embedded feature-selection methods. When would you use each?

Filter methods score features by statistical relevance to the target independently of any model, so they're fast but ignore feature interactions. Wrapper methods (like recursive feature elimination) search subsets by training a model and evaluating performance, which is accurate but computationally expensive. Embedded methods select features as part of model training (like lasso or tree importances), giving a good balance of accuracy and efficiency.

What are filter, wrapper, and embedded feature selection methods, and when do you use each?

Filter methods score features independently of the model using statistics like mutual information or correlation; they are fast but ignore feature interactions. Wrapper methods search subsets by actually training the model, finding better subsets at high computational cost. Embedded methods perform selection during training — LASSO and tree-based feature importances are the most common — offering a balance of quality and speed.

What's the difference between feature selection and dimensionality reduction like PCA?

Feature selection keeps a subset of the original features and discards the rest, so the surviving features stay interpretable. Dimensionality reduction like PCA creates new features that are combinations of the originals, compressing information but losing direct interpretability. Choose feature selection when you need to explain which inputs matter, and PCA when you mainly need a compact representation and don't need named features.

What is feature engineering, and can you walk through how you'd engineer features to improve a model?

Feature engineering is creating, transforming, or selecting input variables so a model can capture patterns more easily. Common techniques include scaling, encoding categoricals, binning, interaction and ratio features, date/time decomposition, and domain-derived aggregates. It often matters more than the choice of algorithm because models can only learn from the signal present in their inputs.

Related lessons

Explore further

Skip to content