Compare filter, wrapper, and embedded feature-selection methods. When would you use each?

Filter methods score features by statistical relevance to the target independently of any model, so they're fast but ignore feature interactions. Wrapper methods (like recursive feature elimination) search subsets by training a model and evaluating performance, which is accurate but computationally expensive. Embedded methods select features as part of model training (like lasso or tree importances), giving a good balance of accuracy and efficiency.

What are filter, wrapper, and embedded feature selection methods, and when do you use each?

Filter methods score features independently of the model using statistics like mutual information or correlation; they are fast but ignore feature interactions. Wrapper methods search subsets by actually training the model, finding better subsets at high computational cost. Embedded methods perform selection during training — LASSO and tree-based feature importances are the most common — offering a balance of quality and speed.

What's the difference between feature selection and dimensionality reduction like PCA?

Feature selection keeps a subset of the original features and discards the rest, so the surviving features stay interpretable. Dimensionality reduction like PCA creates new features that are combinations of the originals, compressing information but losing direct interpretability. Choose feature selection when you need to explain which inputs matter, and PCA when you mainly need a compact representation and don't need named features.

What is feature engineering, and can you walk through how you'd engineer features to improve a model?

Feature engineering is creating, transforming, or selecting input variables so a model can capture patterns more easily. Common techniques include scaling, encoding categoricals, binning, interaction and ratio features, date/time decomposition, and domain-derived aggregates. It often matters more than the choice of algorithm because models can only learn from the signal present in their inputs.

Feature selection — Machine Learning

More features feel like more information, but past a point they hurt: every useless column adds noise, invites overfitting, slows the model, and clouds the explanation. Feature selection is choosing the subset that actually carries signal — and on real tabular problems, a smaller, well-chosen set often beats throwing everything in.

Three families

There are three ways to decide what to keep, trading speed for thoroughness:

Filter scores features independently; wrapper searches subsets; embedded selects while the model trains.

Filter — score each feature independently of any model (correlation with the target, chi-squared, mutual information) and keep the top ones. Fast and model-agnostic, but blind to feature interactions (two weak features that are strong together get dropped).
Wrapper — actually train the model on different feature subsets and keep the best. Recursive Feature Elimination (RFE) repeatedly drops the weakest feature and refits. Accurate but expensive (many model fits).
Embedded — selection happens inside training. L1/Lasso regularization drives useless coefficients to exactly zero; tree importances rank features as a byproduct. Efficient and usually the practical default.

Permutation importance — the honest ranker

A model-agnostic favorite: permutation importance measures how much the score drops when you randomly shuffle one feature’s values. A big drop means the model relied on it; no drop means it was dead weight. Unlike a tree’s built-in importance, it’s measured on held-out data and works for any model.

pyPython · scikit-learnscikit-learn · numpy

Ready · ⌘↵

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.feature_selection import RFE
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# 20 features, but only 5 are informative.
X, y = make_classification(n_samples=1500, n_features=20, n_informative=5, random_state=0)
Xtr, Xte, ytr, yte = train_test_split(X, y, random_state=0)

rf = RandomForestClassifier(n_estimators=200, random_state=0).fit(Xtr, ytr)
imp = permutation_importance(rf, Xte, yte, n_repeats=10, random_state=0).importances_mean
top = np.argsort(imp)[::-1][:6]
print("top features by permutation importance:", top.tolist())

# RFE (wrapper) — keep the 5 best as judged by the model itself
keep = np.where(RFE(rf, n_features_to_select=5).fit(Xtr, ytr).support_)[0]
print("RFE selected 5 features:", keep.tolist())
print("\nBoth recover the ~5 truly informative features out of 20.")

Output

Press Run to execute

Quick check

0/3

Q1What's the key weakness of filter methods (scoring each feature independently)?

Q2What does permutation importance measure?

Q3Why must feature selection happen inside the cross-validation loop?

To combine features instead of dropping them, see PCA; to explain the survivors, see SHAP and interpretability.

Feature selection

What you'll learn

Before you start

Three families

Permutation importance — the honest ranker

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further