datarekha

Walk me through how you'd select between competing models without fooling yourself with data leakage.

The short answer

Split data into train, validation, and test sets (or use cross-validation), tune and compare models only on train/validation, and touch the test set exactly once at the end. Fit all preprocessing inside the cross-validation pipeline so transformers never see validation data, and for tuning plus honest evaluation use nested cross-validation. For time series, use forward-chaining splits to avoid leaking future information.

How to think about it

The crisp answer

Keep a strict separation between data used to build/tune models and data used to judge them. Train and tune on training/validation folds, compare candidates there, and reserve a test set you touch exactly once for the final, unbiased number. Put all preprocessing inside the CV pipeline so nothing leaks.

The procedure

  1. Split: train / validation / test, or k-fold CV plus a held-out test set.
  2. Tune each candidate’s hyperparameters using only train/validation (cross-validation).
  3. Compare models on validation performance with a metric tied to the business goal.
  4. Evaluate the single chosen model on the test set once.

Avoiding leakage

The most common leak is fitting scalers, imputers, or target encoders on the whole dataset before splitting — the discussion of data leakage in cross-validation stresses fitting these inside each fold via a pipeline. Other leaks: features computed using future or target-derived information, duplicate rows spanning train and test, and reusing the test set to make decisions.

When tuning and evaluating together

If you both tune and want an honest estimate from the same data, use nested cross-validation so model selection never sees the outer test fold. For time series, use forward-chaining (expanding-window) splits so you never train on data that comes after your validation period.

The common trap

“Peeking” — checking test performance, tweaking the model, and rechecking. That silently turns the test set into a validation set and inflates results. Decide the metric and protocol up front. Follow-up: “How would you compare two models’ scores?” — use the same CV folds for both and look at the distribution/variance across folds, not a single point estimate.

Learn it properly Model selection & nested CV

Keep practising

All Machine Learning questions

Explore further

Skip to content