datarekha
Time Series Medium Asked at UberAsked at AirbnbAsked at AmazonAsked at DoorDash

How do you engineer lag and rolling features for a time series ML model, and what leakage risks arise?

The short answer

Lag features shift the target or exogenous variables backward by k steps so the model sees past values as inputs; rolling features (rolling mean, std, min, max) summarise a window of past observations. Both must be computed strictly on past data — any feature that incorporates information from the current or future rows leaks the target and inflates metrics.

How to think about it

Cover the mechanics of creating lag/rolling features correctly, the leakage failure mode, and how to prevent it. This is one of the most practically important time series questions for ML roles.

Lag features

A lag-k feature for target y at time t is simply y[t-k]. It gives the model direct access to past observations as predictors.

import pandas as pd

df = pd.DataFrame({"y": sales_series})

# Lags
for k in [1, 2, 3, 7, 14]:
    df[f"lag_{k}"] = df["y"].shift(k)

df.dropna(inplace=True)

After shift(k), row t contains the value from t-k in its lag column — no future information is present. The dropna removes the first k rows where lags don’t exist yet.

Rolling (window) features

Rolling statistics summarise a window of past values. The window must be anchored to the past only — use closed="left" or manually shift before rolling.

# Correct: shift first, then roll — window is [t-w, t-1]
df["roll_mean_7"] = df["y"].shift(1).rolling(7).mean()
df["roll_std_7"]  = df["y"].shift(1).rolling(7).std()
df["roll_max_7"]  = df["y"].shift(1).rolling(7).max()

Without the shift(1), the rolling window at row t includes the current value y[t], which is the target — a direct label leak.

Leakage taxonomy for time series features

Feature typeLeak riskSafe construction
lag_k (k ≥ 1)None if shift is applieddf[“y”].shift(k)
rolling mean without shiftDirect label leakdf[“y”].shift(1).rolling(w).mean()
future exogenous (e.g., tomorrow’s weather)Look-ahead leakOnly use if genuinely available at forecast time
target-encoded group stats on full dataCross-row leakCompute inside each fold

Choosing which lags to include

Start with lag 1, lag 7, lag 14 for daily data (recent memory + weekly pattern). Use the PACF of the target to identify statistically significant lags. For tree-based models (XGBoost, LightGBM), feature importance and SHAP values can prune uninformative lag columns afterward.

Learn it properly Why time series is different

Keep practising

All Time Series questions

Explore further

Skip to content