datarekha

Model selection & nested CV

If you tune hyperparameters and report the same CV score, you're lying to yourself. Nested cross-validation separates tuning from evaluation so your reported number is honest.

7 min read Advanced Machine Learning Lesson 23 of 33

What you'll learn

  • Why tuning and evaluating on the same CV gives an optimistically biased score
  • How nested CV separates the inner (tune) and outer (evaluate) loops
  • When you actually need it vs a simple train/val/test split

Before you start

Here’s a subtle way to fool yourself. You run GridSearchCV over 200 hyperparameter combinations, pick the best, and report its cross-validation score as your model’s performance. That number is optimistically biased — and the more combinations you tried, the more biased it is. The fix is nested cross-validation.

The leak: tuning and scoring on the same folds

Recall how cross-validation works — fold the data, train on some, validate on the rest, rotate:

The problem: if you use that same CV loop to choose hyperparameters and to report the score, you’ve used the validation folds twice — once to pick the winner, once to grade it. With enough combinations, some configuration scores well on those particular folds by luck, and you report that lucky number. It’s a softer cousin of data leakage: the evaluation has seen the choices it should be judging.

Nested CV: two loops

Nested cross-validation separates the two jobs into two loops:

  • Inner loop — for each outer training split, run a full CV search to tune the hyperparameters.
  • Outer loop — evaluate the tuned model on the held-out outer fold it never touched during tuning. Average those outer scores.

Because each outer fold’s score comes from a model tuned without seeing it, the average is an unbiased estimate of how your whole tuning procedure performs.

outer-train (tune here)outer-test (score)↓ inner CV loop tunes hyperparametersinner folds pick the best hyperparameters — the outer-test fold is never seen
Outer loop evaluates; inner loop tunes. The outer-test fold is held out from all tuning.

Quick check

Quick check

0/3
Q1Why is reporting the best GridSearchCV score as your model's performance optimistically biased?
Q2What do the inner and outer loops of nested CV do?
Q3When is a simple three-way (train/validation/test) split a fine alternative to nested CV?

Next

That rounds out honest evaluation. Next, trimming the inputs themselves — feature selection — and the unsupervised pillar (clustering, PCA).

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
Why do you need nested cross-validation, and what problem does it solve over regular cross-validation?

Nested cross-validation separates hyperparameter tuning from performance estimation using an inner loop for model selection and an outer loop for evaluation. It solves the optimistic-bias problem: if you tune and evaluate on the same folds, the validation data leaks into model selection and your reported score overestimates real-world performance. The inner loop never touches the outer test fold, giving an unbiased estimate of the whole pipeline's generalization.

Walk me through how you'd select between competing models without fooling yourself with data leakage.

Split data into train, validation, and test sets (or use cross-validation), tune and compare models only on train/validation, and touch the test set exactly once at the end. Fit all preprocessing inside the cross-validation pipeline so transformers never see validation data, and for tuning plus honest evaluation use nested cross-validation. For time series, use forward-chaining splits to avoid leaking future information.

What is k-fold cross-validation and when should you use it over a single train/validation split?

K-fold CV partitions data into k equal folds, trains on k-1 and validates on the remaining fold k times, then averages the k scores. It gives a lower-variance estimate of generalization error than a single split and is preferred when the dataset is small enough that a single held-out set would be too noisy or wasteful.

Why do we split data into train, validation, and test sets, and what are the typical proportions?

The train set fits the model, the validation set tunes hyperparameters and guides model selection, and the held-out test set provides an unbiased estimate of final generalization error. Using the test set during development causes optimistic bias because the evaluation signal leaks into decisions.

Related lessons

Explore further

Skip to content