Compare filter, wrapper, and embedded feature-selection methods. When would you use each?
Filter methods score features by statistical relevance to the target independently of any model, so they're fast but ignore feature interactions. Wrapper methods (like recursive feature elimination) search subsets by training a model and evaluating performance, which is accurate but computationally expensive. Embedded methods select features as part of model training (like lasso or tree importances), giving a good balance of accuracy and efficiency.
How to think about it
The crisp answer
The three families trade off speed against accuracy. Filter methods rank features by a statistical score independent of any model (fast, ignores interactions). Wrapper methods evaluate feature subsets by actually training a model (accurate, expensive). Embedded methods do selection during training (a balance of both).
How each works
As the bugfree.ai feature-selection guide lays out:
- Filter: correlation, chi-square, mutual information, ANOVA F-test, variance threshold. Score and keep the top features. Cheap and model-agnostic, but blind to how features interact and tends to keep redundant ones.
- Wrapper: forward selection, backward elimination, recursive feature elimination (RFE). Train a model on candidate subsets and keep the best by validation score. Captures interactions but is combinatorially expensive and can overfit the selection.
- Embedded: lasso (L1) zeroing coefficients, tree/gradient-boosting feature importances, elastic net. Selection happens inside model fitting, so it’s efficient and interaction-aware.
When to use each
- Filter: very high-dimensional data or a fast first-pass screen.
- Wrapper: smaller feature sets where accuracy justifies the compute.
- Embedded: the practical default — get model-aware selection “for free” during training.
The common trap
Running feature selection on the whole dataset before cross-validation — that leaks target information and inflates scores; selection must happen inside each CV fold. Also, filter-method correlation only catches linear relationships and ignores redundancy. Follow-up: “Why is RFE expensive?” — it retrains the model repeatedly while removing the weakest feature each round.