What is k-fold cross-validation and when should you use it over a single train/validation split?
K-fold CV partitions data into k equal folds, trains on k-1 and validates on the remaining fold k times, then averages the k scores. It gives a lower-variance estimate of generalization error than a single split and is preferred when the dataset is small enough that a single held-out set would be too noisy or wasteful.
How to think about it
A single validation split has high variance — the score depends heavily on which examples end up in the validation fold. K-fold eliminates this by rotating the validation window across the entire dataset.
Procedure:
- Shuffle and partition
nsamples intokequal folds. - For
i = 1 … k: train on all folds except foldi, evaluate on foldi. - Report mean ± std of the
kscores.
Choosing k:
k = 5ork = 10are standard defaults; empirical studies show 10-fold balances bias and variance well.k = n(leave-one-out CV, LOOCV) is nearly unbiased but computationally expensive and has high variance on the score estimate.
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
scores = cross_val_score(
RandomForestClassifier(n_estimators=100, random_state=0),
X, y, cv=5, scoring="roc_auc"
)
print(f"AUC: {scores.mean():.3f} +/- {scores.std():.3f}")
When to prefer a single split: very large datasets (CV is computationally prohibitive), or when the pipeline has stateful preprocessing that is expensive to re-fit k times. Even then, keep the test set strictly held out.