Cross-Validation: k-fold, LOO, Stratified
One train/test split is a coin flip. Cross-validation rotates the validation set so every sample is tested once — averaging out the luck.
What you'll learn
- Why a single train/test split gives a noisy, luck-dependent score
- k-fold CV: split into k folds, train on k−1, validate on 1, rotate, average — that is k models
- Leave-one-out (LOO) is k = n, so the number of folds equals the number of training samples
- Stratified CV preserves class proportions — essential for imbalanced data
Before you start
You split your data 80/20, train, and score 0.84. You reshuffle the split, retrain, and now score 0.79. Which number is the truth? Neither — a single train/test split is noisy, because the test set is just one small random sample and your score depends on which rows happened to land in it. Cross-validation removes that luck by testing on every row, then averaging.
k-fold cross-validation
Split the data into k equal folds. Hold out one fold as the validation set,
train on the other k−1, and record the score. Then rotate: each fold gets a turn as
the validation set exactly once. You train k separate models and average their
k scores for one stable estimate (plus a standard deviation for spread).
Drag k and watch the folds rotate — notice how every row eventually gets validated:
Leave-one-out (LOO): the extreme case
Push k all the way up to n, the number of training samples. Now each fold holds
exactly one sample: you train on n−1 rows and validate on the single left-out
row, then repeat for every row. So the number of folds — and the number of models
you train — equals the number of training samples. LOO uses the most data per model
(only one row held out), but it pays for that by fitting n models.
Stratified cross-validation
Plain k-fold shuffles rows blindly. On imbalanced data — say 5% positives — a random fold might end up with zero positive examples, making its score meaningless. Stratified CV fixes this by preserving the class proportions in every fold: if the full set is 5% positive, each fold is held to ~5% positive too. For classification, especially imbalanced classification, stratified k-fold is the default.
How GATE asks this
The signature NAT hands you a dataset size and a CV scheme and asks for the number
of iterations (models). The trap is the held-out test set: GATE DA 2026 gave 1000
samples with 100 held out as a test set, then asked how many iterations LOOCV runs
on the remainder. LOO trains one model per training sample, and there are 1000 − 100 = 900 of those — so the answer is 900, not 1000. The MCQ variant
describes a 5% positive dataset and asks which scheme to use and what to report:
stratified CV, scored with AUC rather than plain accuracy.
Worked example
A dataset has 1000 samples. You set aside 100 as a held-out test set and run leave-one-out cross-validation on the rest. How many iterations does LOOCV run?
Training set size after the holdout: 1000 − 100 = 900. LOO sets k = n, training one
model per training sample. So it runs 900 iterations — one model per left-out row.
Were you instead to use plain 5-fold CV on those 900, you would train only 5
models, each validating on 900 / 5 = 180 rows.
Quick check
Quick check
Practice this in an interview
All questionsK-fold CV partitions data into k equal folds, trains on k-1 and validates on the remaining fold k times, then averages the k scores. It gives a lower-variance estimate of generalization error than a single split and is preferred when the dataset is small enough that a single held-out set would be too noisy or wasteful.
Stratified k-fold ensures each fold has the same class-label proportions as the full dataset. It is necessary for imbalanced classification because standard random k-fold can produce folds where a minority class is entirely absent, making per-fold metrics undefined or severely misleading.
Standard k-fold randomly shuffles data, so a validation fold can contain timestamps earlier than the training fold — training on the future to predict the past. Time-series CV uses walk-forward (expanding-window or sliding-window) splits that always validate on data strictly after the training window.
The train set fits the model, the validation set tunes hyperparameters and guides model selection, and the held-out test set provides an unbiased estimate of final generalization error. Using the test set during development causes optimistic bias because the evaluation signal leaks into decisions.