Walk me through how you'd select between competing models without fooling yourself with data leakage.

Split data into train, validation, and test sets (or use cross-validation), tune and compare models only on train/validation, and touch the test set exactly once at the end. Fit all preprocessing inside the cross-validation pipeline so transformers never see validation data, and for tuning plus honest evaluation use nested cross-validation. For time series, use forward-chaining splits to avoid leaking future information.

Why do you need nested cross-validation, and what problem does it solve over regular cross-validation?

Nested cross-validation separates hyperparameter tuning from performance estimation using an inner loop for model selection and an outer loop for evaluation. It solves the optimistic-bias problem: if you tune and evaluate on the same folds, the validation data leaks into model selection and your reported score overestimates real-world performance. The inner loop never touches the outer test fold, giving an unbiased estimate of the whole pipeline's generalization.

What is k-fold cross-validation and when should you use it over a single train/validation split?

K-fold CV partitions data into k equal folds, trains on k-1 and validates on the remaining fold k times, then averages the k scores. It gives a lower-variance estimate of generalization error than a single split and is preferred when the dataset is small enough that a single held-out set would be too noisy or wasteful.

Why do we split data into train, validation, and test sets, and what are the typical proportions?

The train set fits the model, the validation set tunes hyperparameters and guides model selection, and the held-out test set provides an unbiased estimate of final generalization error. Using the test set during development causes optimistic bias because the evaluation signal leaks into decisions.

Model selection & nested CV — Machine Learning

Here’s a subtle way to fool yourself. You run GridSearchCV over 200 hyperparameter combinations, pick the best, and report its cross-validation score as your model’s performance. That number is optimistically biased — and the more combinations you tried, the more biased it is. The fix is nested cross-validation.

The leak: tuning and scoring on the same folds

Recall how cross-validation works — fold the data, train on some, validate on the rest, rotate:

Each row trains on four folds and validates on the fifth; rotate through all five and average to get one CV score.

The problem: if you use that same CV loop to choose hyperparameters and to report the score, you’ve used the validation folds twice — once to pick the winner, once to grade it. With enough combinations, some configuration scores well on those particular folds by luck, and you report that lucky number. It’s a softer cousin of data leakage: the evaluation has seen the choices it should be judging.

Nested CV: two loops

Nested cross-validation separates the two jobs into two loops:

Inner loop — for each outer training split, run a full CV search to tune the hyperparameters.
Outer loop — evaluate the tuned model on the held-out outer fold it never touched during tuning. Average those outer scores.

Because each outer fold’s score comes from a model tuned without seeing it, the average is an unbiased estimate of how your whole tuning procedure performs.

Outer loop evaluates; inner loop tunes. The outer-test fold is held out from all tuning.

import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)
grid = {"C": [0.1, 1, 10, 100], "gamma": [0.001, 0.01, 0.1]}

inner = KFold(5, shuffle=True, random_state=1)
outer = KFold(5, shuffle=True, random_state=2)

search = GridSearchCV(SVC(), grid, cv=inner)        # inner loop: tunes
# Flat (biased): the score of the best inner config
search.fit(X, y); print(f"flat   (optimistic): {search.best_score_:.3f}")

# Nested (honest): wrap the search in an outer CV
nested = cross_val_score(search, X, y, cv=outer)
print(f"nested (honest):     {nested.mean():.3f}")
print("\nThe flat score is usually higher — that gap is the selection bias.")

In one breath

Tune over many combinations and report the winner’s own CV score, and you flatter yourself — the more you tried, the more that number is luck on those folds.
Nested CV splits the work: an inner loop tunes, an outer loop scores on a fold the tuning never saw.
Average the outer-fold scores and you get an honest estimate of the whole tuning procedure, free of selection bias.
It is expensive — reach for it on small data, when comparing model families, or when publishing a number someone will trust.
On big data, a plain train/validation/test split is cheaper and just as honest: never let the data that grades you also choose you.

Quick check

0/3

Q1Why is reporting the best GridSearchCV score as your model's performance optimistically biased?

Q2What do the inner and outer loops of nested CV do?

Q3When is a simple three-way (train/validation/test) split a fine alternative to nested CV?

That rounds out honest evaluation. Next, trimming the inputs themselves — feature selection — and the unsupervised pillar (clustering, PCA).

Model selection & nested CV

What you'll learn

Before you start

The leak: tuning and scoring on the same folds

Nested CV: two loops

In one breath

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further