Why can't you shuffle a time series before splitting into train and test sets?
Shuffling destroys temporal order, so the model trains on future data and is evaluated on the past — a direct information leak. Time series observations are serially correlated, meaning past values predict future ones, and any random split obliterates that structure entirely.
How to think about it
Keep the answer tight: shuffling breaks causality and leaks the future. Interviewers want to hear you name the exact mechanism, not just say “order matters.”
What goes wrong
A time series is a sequence where observation at time t depends on observations at t-1, t-2, and so on — that is the signal you are trying to learn. When you shuffle:
- Training rows include timestamps that come after test rows. The model has effectively seen the future during training.
- Rolling statistics, lags, and any feature derived from prior rows are computed on the shuffled order, producing meaningless or inflated values.
- Evaluation looks great in-sample but the trained model fails completely in production, where time flows forward.
The correct split
Place the split at a single cutoff point. Everything before it is train; everything after is test.
import pandas as pd
df = pd.read_csv("sales.csv", parse_dates=["date"], index_col="date").sort_index()
cutoff = "2023-12-31"
train = df.loc[:cutoff]
test = df.loc[cutoff:] # strictly after cutoff
For hyperparameter tuning, extend this to walk-forward (expanding-window) cross-validation so each fold’s validation set is always in the future relative to its training set.
Why random k-fold is doubly wrong
Random k-fold does two things that hurt: it shuffles the rows, and then it places future observations in one fold’s training split while past ones sit in its validation split. Both effects inflate apparent performance.