Cross-validation: a score you didn't overfit to

In 2017 a Kaggle team submitted a model with a public leaderboard score of 0.971. After the competition ended and the private leaderboard was scored on held-out data the team had never seen, the number fell to 0.803. A 17-point gap. The model had not failed in any obvious way. No bugs. No missing features. The evaluation procedure had simply been telling a flattering lie — one repetition at a time, as the team tuned against the same test distribution until they and the test set were no longer strangers.

This is the fundamental problem that cross-validation exists to solve: the moment you use a test set to make a decision, it stops being a test set.

What a single split actually gives you

Split your data 80/20. Train on 80 percent, measure accuracy on the 20 percent. You get a number — say, 0.84. What does that number mean?

It means: on this particular random draw of 20 percent of your rows, under this particular random seed, the model scored 0.84. It is a point estimate with an unknown standard error. On a dataset of 500 examples your test set is 100 rows. The 95-percent confidence interval on a proportion measured over 100 Bernoulli trials is roughly plus or minus 8 percentage points. That interval swallows most of the performance differences practitioners argue about.

The test set is also a biased judge. If the 20 percent happened to land slightly more of the easy examples — cleaner labels, more central-cluster points, less noisy outliers — the score looks better than the model deserves. The reverse is equally possible. You have no way to know which you got. One split gives you one sample from a distribution of possible scores, and you are reporting it as if it is the distribution.

There is a second, slower problem. Every time you look at the test score and adjust something — change a hyperparameter, add a feature, swap an algorithm — you are implicitly optimizing against that split. The test set is leaking its preferences back into your choices. Run enough experiments and your final model is no longer evaluated on unseen data; it is evaluated on data it has seen indirectly, through your decision loop. This is selection bias with extra steps.

The k-fold idea

The insight in k-fold cross-validation is elementary but consequential: if one test split is a noisy estimate, run k of them and average.

Partition your data into k equal chunks — call them folds. For each fold, train on the remaining k-1 folds and score on the held-out one. You now have k scores from k non-overlapping test sets that together cover every row exactly once. Average them. That average is a better estimate of generalization than any single split, and the standard deviation across the k scores tells you something a single split cannot: how stable the estimate is.

Each round tests on a different fold; together they cover the full dataset exactly once. The variance across scores is as informative as the mean.

The standard deviation is the part people underuse. If your five scores are 0.81, 0.84, 0.83, 0.79, 0.85 — mean 0.824, std 0.022 — that tells you the model is reasonably stable. If your five scores are 0.62, 0.97, 0.78, 0.91, 0.84 — same mean (0.82), std 0.12 — you have a model that is wildly sensitive to which rows it sees. The mean is the same. The models are not.

A high standard deviation is usually a symptom of one of three things: too little data in each fold, a model that is very high-variance (deep trees, high-degree polynomials, unregularized neural nets), or genuine heterogeneity in the data that the partition accidentally surfaces. All three are diagnostic signals worth chasing, not noise to be averaged away.

The leakage trap that swallows honest estimates

This is the part that quietly ruins more CV setups than any other mistake, and it is subtle enough that experienced practitioners get it wrong.

Suppose your pipeline is: impute missing values using column means, scale features to zero mean and unit variance, fit a model. You compute the column means on all rows before the fold split, then scale all rows using those means and standard deviations, then split into folds, then train and score.

You have just leaked. The scaling statistics — means and standard deviations — were computed using information from the test fold. The test fold contributed its values to the scaler’s fit. When the model sees the test fold, it is not encountering truly unseen data; the test fold’s distribution has already seeped into the preprocessing. On a large dataset this effect is small. On a 200-row dataset it is not, and even on large datasets it is a methodological lie that will misrepresent performance on genuinely new data.

The correct setup is mechanical but non-negotiable: every preprocessing step that learns anything from data — imputers, scalers, encoders, feature selectors, PCA — must be fit on the training folds only, then applied to the test fold. Nothing may look at the test fold before the scoring moment. In sklearn this is exactly what a Pipeline object enforces. It is not a convenience feature; it is the thing that makes the score mean what you claim it means.

The intuition is simple: in production, when a new row arrives, you will apply a scaler that was fit on training data. The CV score should simulate exactly that situation, not a situation where the scaler has already seen the new row.

How many folds?

The canonical answer is 5 or 10, and the reasoning behind that range is worth understanding rather than memorizing.

With k=2, each model is trained on half the data. That is a substantially smaller dataset than what you will deploy with, so the score systematically underestimates the performance of the final model (which will be trained on all the data). The estimate has high bias.

With k=n — leave-one-out cross-validation, where each test set is a single example — each model is trained on almost all the data, so the bias problem disappears. But the n models’ scores are highly correlated, because their training sets differ by only one row. Correlated estimates average into an estimate with low bias but high variance. Leave-one-out is also expensive: n model fits instead of k.

k=5 and k=10 sit in a bias-variance valley of their own. They use 80 or 90 percent of the data for training in each fold, limiting the training-set-size bias; the test folds overlap enough to give diverse estimates, limiting the correlation and thus the variance of the average.

The general rule: if your dataset is large (tens of thousands of rows), k=5 is sufficient — the folds are big enough to be representative. If your dataset is small (hundreds of rows), k=10 or even leave-one-out is worth the compute cost because every row matters.

When standard k-fold breaks

Two common situations make vanilla k-fold give wrong answers, and both require a structural modification.

Imbalanced classes. If 5 percent of your rows are the positive class, a random fold partition might put all of them in two folds and none in the other three. Scoring on a fold that contains no positives at all gives you a trivially high accuracy (predict negative always) and tells you nothing about whether the model learned the positive class. Stratified k-fold preserves the class proportion in each fold. It is a simple mechanical fix with a large impact on estimate reliability.

Time series. The IID assumption — that each fold is an independent, identically distributed sample from the same population — is violated the moment your data has temporal order. Rows from the future bleed into training; the model learns to exploit patterns that will not exist when you actually deploy it; the score looks optimistic because the test fold is not really unseen. Time-series cross-validation (also called walk-forward validation) enforces temporal order: the training set for each fold contains only rows that precede the test rows in time. The test set always comes after. You pay a cost: early folds have tiny training sets, so the first few scores are noisy. But the average is honest in a way that shuffled k-fold is not.

The deeper principle is that the fold structure must mirror the deployment structure. When a model goes to production, what kind of data will it have seen at training time, and what kind will it encounter? The CV setup should replicate that boundary exactly.

The outer loop people forget

Cross-validation is about estimating generalization. It is not about selecting hyperparameters. When you tune hyperparameters using a CV score, that score starts to drift optimistic for the same reason a single test set does: you are making decisions against it, and the model that maximizes it was selected precisely because it was lucky on this particular partition.

The correct approach — nested cross-validation — wraps hyperparameter tuning inside an outer loop. The outer loop has k folds that are never touched by the tuning process. Inside each outer fold, an inner cross-validation loop (on the training portion only) finds the best hyperparameters. The outer fold score is then measured using those hyperparameters on the outer test fold, which participated in neither tuning nor model fitting. The outer scores give you an unbiased estimate of how the best-tuned version of your pipeline generalizes.

Nested CV is expensive: if you have 5 outer folds and 5 inner folds, you are fitting 25 models for every hyperparameter configuration you try. Most practitioners skip it. That is a defensible engineering tradeoff on large datasets where the bias is small. On small datasets, skipping nested CV means your model-selection score is lying to you, and there is no easy way to know by how much.

What the score is actually estimating

There is a subtle point that most treatments skip: the CV score is an estimate of the expected performance of a model trained on a dataset of size (k-1)/k * n — not of a model trained on all n rows.

When you finalize and deploy, you train on all n rows. That model will generally outperform the CV models because it has seen more data. So the CV score is slightly pessimistic relative to your deployed model, particularly when n is small and the learning curve is still steep. This is a feature, not a bug — pessimism is the right direction for an evaluation procedure to err in.

The flip side: when someone reports a CV score on a 200-row dataset using 10-fold, each model trains on 180 rows. The deployed model trains on 200. If those 20 additional rows sit in a steep region of the learning curve, the deployed model may be meaningfully better. The CV score is still the best honest estimate you have, but it is an estimate of a subtly different quantity than the final model’s performance.

The habit worth building

The discipline of cross-validation is not primarily a statistical technique. It is a habit of intellectual honesty about what you know and what you are inferring.

A single test score is a fact about one partition of one dataset. A CV score is an estimate with a standard error you can actually report. A nested CV score is an estimate that has not been contaminated by the choices you made. Each step narrows the gap between what your score says and what your deployed model will do.

The Kaggle story at the top is not unusual. It happens in industry too, more quietly — in notebooks where the test set was peeked at once, twice, a dozen times, until the reported number was something worth showing in a meeting. The number is real. The confidence it warrants is not.

Cross-validation does not prevent you from overfit; it prevents you from overfit to your evaluation procedure. That is a different thing, and in practice it matters more.