Why do you need nested cross-validation, and what problem does it solve over regular cross-validation?
Nested cross-validation separates hyperparameter tuning from performance estimation using an inner loop for model selection and an outer loop for evaluation. It solves the optimistic-bias problem: if you tune and evaluate on the same folds, the validation data leaks into model selection and your reported score overestimates real-world performance. The inner loop never touches the outer test fold, giving an unbiased estimate of the whole pipeline's generalization.
How to think about it
The crisp answer
Nested cross-validation uses two loops: an inner loop that selects hyperparameters and an outer loop that estimates performance. You need it because tuning and evaluating on the same data produces an optimistically biased score — the validation folds have implicitly influenced model selection, so they no longer give an honest estimate of generalization.
The problem it solves
In ordinary k-fold CV used for tuning, you try many hyperparameter settings and pick the best validation score. But picking the max over many configurations means you’ve partly fit to the noise of those folds — a subtle form of data leakage. The reported best score is biased upward. As the scikit-learn nested CV example shows, the gap between nested and non-nested scores quantifies exactly this optimism.
How the structure fixes it
- Outer loop: split into K folds; each outer test fold is held out purely for evaluation.
- Inner loop: within each outer training portion, run another CV to choose hyperparameters.
- The chosen model is then scored once on the untouched outer test fold.
Because hyperparameter selection only ever sees inner training data, the outer estimate reflects the entire procedure, not one lucky configuration.
Concrete example
Comparing SVM vs random forest and tuning each: nested CV gives an unbiased estimate of “how well does my model-selection process generalize,” which is what you’d actually deploy.
The common trap
Thinking nested CV produces the final model — it doesn’t; it estimates the procedure’s performance. After you trust the estimate, you refit on all data with the tuning procedure to get the deployed model. It’s also expensive (K_outer × K_inner × configs), so people skip it and overstate results. Follow-up: “When can you skip it?” — with a large dedicated hold-out test set, a single tuned CV plus that test set may suffice.