What are filter, wrapper, and embedded feature selection methods, and when do you use each?
Filter methods score features independently of the model using statistics like mutual information or correlation; they are fast but ignore feature interactions. Wrapper methods search subsets by actually training the model, finding better subsets at high computational cost. Embedded methods perform selection during training — LASSO and tree-based feature importances are the most common — offering a balance of quality and speed.
How to think about it
Feature selection reduces overfitting, shortens training time, and improves interpretability. Knowing which method to reach for depends on dataset size and whether you can afford repeated model training.
Filter methods
Score each feature independently, then threshold. Common statistics:
- Variance threshold: drop near-constant features.
- Mutual information: measures nonlinear dependence between a feature and the target.
- ANOVA F-statistic: linear association for classification tasks.
- Pearson correlation: linear dependence; misses nonlinear relationships.
from sklearn.feature_selection import SelectKBest, mutual_info_classif
selector = SelectKBest(mutual_info_classif, k=20)
X_selected = selector.fit_transform(X_train, y_train)
Use when: the dataset is large and model-training loops would be prohibitive; as a fast first pass before wrapper or embedded methods.
Wrapper methods
Train the model on every candidate subset. Common strategies:
- Recursive Feature Elimination (RFE): train the model, remove the weakest feature, repeat.
- Forward / backward selection: greedily add or remove features based on validation score.
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
rfe = RFE(LogisticRegression(max_iter=1000), n_features_to_select=10)
rfe.fit(X_train, y_train)
X_reduced = rfe.transform(X_train)
Use when: the feature count is moderate (<100) and you can afford multiple model fits. Produces the best subset for the chosen model but is slow and prone to overfitting on small datasets.
Embedded methods
Selection happens as part of fitting the model:
- LASSO (L1 regularization): drives irrelevant feature coefficients exactly to zero.
- Tree feature importances: impurity-based or permutation importance from Random Forest or gradient-boosted trees.
- ElasticNet: combines L1 and L2, useful when features are correlated.
from sklearn.linear_model import LassoCV
lasso = LassoCV(cv=5).fit(X_train, y_train)
important = X_train.columns[lasso.coef_ != 0]
Use when: you want selection and modeling in one step; scales to high-dimensional data.
Comparison
| Method | Speed | Interaction-aware | Model-dependent |
|---|---|---|---|
| Filter | Fast | No | No |
| Wrapper | Slow | Yes | Yes |
| Embedded | Medium | Partially | Yes |