datarekha
Machine Learning Easy Asked at GoogleAsked at AmazonAsked at Meta

Why do we split data into train, validation, and test sets, and what are the typical proportions?

The short answer

The train set fits the model, the validation set tunes hyperparameters and guides model selection, and the held-out test set provides an unbiased estimate of final generalization error. Using the test set during development causes optimistic bias because the evaluation signal leaks into decisions.

How to think about it

Three-way splitting enforces an information barrier between model development and final evaluation.

Train set — the model sees these examples and updates its parameters against them. Typically 60–80% of data.

Validation set — used to compare candidate models, tune hyperparameters (learning rate, tree depth, regularization strength), and apply early stopping. The model never trains on it, but the practitioner makes decisions based on it, so it indirectly shapes the final model. Typically 10–20%.

Test set — touched exactly once, after all development is frozen. Its sole purpose is reporting the generalization metric. Typically 10–20%.

Common splits by dataset size:

Dataset sizeTypical split
Small (thousands)60 / 20 / 20
Medium (hundreds of thousands)70 / 15 / 15
Large (millions+)98 / 1 / 1 — absolute count matters more than percentage
from sklearn.model_selection import train_test_split

# Two-step split: first carve out test, then split remainder
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.15, random_state=42, stratify=y
)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.18, random_state=42, stratify=y_temp
)
# Result: ~70 / ~15 / 15

Why not just train/test? Every time you peek at test-set performance and adjust something, the test set becomes a de facto validation set and you no longer have a clean held-out estimate. This is called test-set peeking and leads to overly optimistic reported performance.

Learn it properly Train/val/test & CV

Keep practising

All Machine Learning questions

Explore further

Skip to content