Model selection & nested CV
If you tune hyperparameters and report the same CV score, you're lying to yourself. Nested cross-validation separates tuning from evaluation so your reported number is honest.
What you'll learn
- Why tuning and evaluating on the same CV gives an optimistically biased score
- How nested CV separates the inner (tune) and outer (evaluate) loops
- When you actually need it vs a simple train/val/test split
Before you start
Here’s a subtle way to fool yourself. You run GridSearchCV over 200 hyperparameter
combinations, pick the best, and report its cross-validation score as your model’s
performance. That number is optimistically biased — and the more combinations
you tried, the more biased it is. The fix is nested cross-validation.
The leak: tuning and scoring on the same folds
Recall how cross-validation works — fold the data, train on some, validate on the rest, rotate:
The problem: if you use that same CV loop to choose hyperparameters and to report the score, you’ve used the validation folds twice — once to pick the winner, once to grade it. With enough combinations, some configuration scores well on those particular folds by luck, and you report that lucky number. It’s a softer cousin of data leakage: the evaluation has seen the choices it should be judging.
Nested CV: two loops
Nested cross-validation separates the two jobs into two loops:
- Inner loop — for each outer training split, run a full CV search to tune the hyperparameters.
- Outer loop — evaluate the tuned model on the held-out outer fold it never touched during tuning. Average those outer scores.
Because each outer fold’s score comes from a model tuned without seeing it, the average is an unbiased estimate of how your whole tuning procedure performs.
Quick check
Quick check
Next
That rounds out honest evaluation. Next, trimming the inputs themselves — feature selection — and the unsupervised pillar (clustering, PCA).
Practice this in an interview
All questionsNested cross-validation separates hyperparameter tuning from performance estimation using an inner loop for model selection and an outer loop for evaluation. It solves the optimistic-bias problem: if you tune and evaluate on the same folds, the validation data leaks into model selection and your reported score overestimates real-world performance. The inner loop never touches the outer test fold, giving an unbiased estimate of the whole pipeline's generalization.
Split data into train, validation, and test sets (or use cross-validation), tune and compare models only on train/validation, and touch the test set exactly once at the end. Fit all preprocessing inside the cross-validation pipeline so transformers never see validation data, and for tuning plus honest evaluation use nested cross-validation. For time series, use forward-chaining splits to avoid leaking future information.
K-fold CV partitions data into k equal folds, trains on k-1 and validates on the remaining fold k times, then averages the k scores. It gives a lower-variance estimate of generalization error than a single split and is preferred when the dataset is small enough that a single held-out set would be too noisy or wasteful.
The train set fits the model, the validation set tunes hyperparameters and guides model selection, and the held-out test set provides an unbiased estimate of final generalization error. Using the test set during development causes optimistic bias because the evaluation signal leaks into decisions.