How do you engineer lag and rolling features for a time series ML model, and what leakage risks arise?
Lag features shift the target or exogenous variables backward by k steps so the model sees past values as inputs; rolling features (rolling mean, std, min, max) summarise a window of past observations. Both must be computed strictly on past data — any feature that incorporates information from the current or future rows leaks the target and inflates metrics.
How to think about it
Cover the mechanics of creating lag/rolling features correctly, the leakage failure mode, and how to prevent it. This is one of the most practically important time series questions for ML roles.
Lag features
A lag-k feature for target y at time t is simply y[t-k]. It gives the model direct access to past observations as predictors.
import pandas as pd
df = pd.DataFrame({"y": sales_series})
# Lags
for k in [1, 2, 3, 7, 14]:
df[f"lag_{k}"] = df["y"].shift(k)
df.dropna(inplace=True)
After shift(k), row t contains the value from t-k in its lag column — no future information is present. The dropna removes the first k rows where lags don’t exist yet.
Rolling (window) features
Rolling statistics summarise a window of past values. The window must be anchored to the past only — use closed="left" or manually shift before rolling.
# Correct: shift first, then roll — window is [t-w, t-1]
df["roll_mean_7"] = df["y"].shift(1).rolling(7).mean()
df["roll_std_7"] = df["y"].shift(1).rolling(7).std()
df["roll_max_7"] = df["y"].shift(1).rolling(7).max()
Without the shift(1), the rolling window at row t includes the current value y[t], which is the target — a direct label leak.
Leakage taxonomy for time series features
| Feature type | Leak risk | Safe construction |
|---|---|---|
| lag_k (k ≥ 1) | None if shift is applied | df[“y”].shift(k) |
| rolling mean without shift | Direct label leak | df[“y”].shift(1).rolling(w).mean() |
| future exogenous (e.g., tomorrow’s weather) | Look-ahead leak | Only use if genuinely available at forecast time |
| target-encoded group stats on full data | Cross-row leak | Compute inside each fold |
Choosing which lags to include
Start with lag 1, lag 7, lag 14 for daily data (recent memory + weekly pattern). Use the PACF of the target to identify statistically significant lags. For tree-based models (XGBoost, LightGBM), feature importance and SHAP values can prune uninformative lag columns afterward.