Time Series Medium Asked at UberAsked at AirbnbAsked at AmazonAsked at DoorDash

How do you engineer lag and rolling features for a time series ML model, and what leakage risks arise?

For Data Scientist ML Engineer Data Analyst AI / LLM Engineer

The short answer

Lag features shift the target or exogenous variables backward by k steps so the model sees past values as inputs; rolling features (rolling mean, std, min, max) summarise a window of past observations. Both must be computed strictly on past data — any feature that incorporates information from the current or future rows leaks the target and inflates metrics.

How to think about it

Cover the mechanics of creating lag/rolling features correctly, the leakage failure mode, and how to prevent it. This is one of the most practically important time series questions for ML roles.

Lag features

A lag-k feature for target y at time t is simply y[t-k]. It gives the model direct access to past observations as predictors.

import pandas as pd

df = pd.DataFrame({"y": sales_series})

# Lags
for k in [1, 2, 3, 7, 14]:
    df[f"lag_{k}"] = df["y"].shift(k)

df.dropna(inplace=True)

After shift(k), row t contains the value from t-k in its lag column — no future information is present. The dropna removes the first k rows where lags don’t exist yet.

Rolling (window) features

Rolling statistics summarise a window of past values. The window must be anchored to the past only — use closed="left" or manually shift before rolling.

# Correct: shift first, then roll — window is [t-w, t-1]
df["roll_mean_7"] = df["y"].shift(1).rolling(7).mean()
df["roll_std_7"]  = df["y"].shift(1).rolling(7).std()
df["roll_max_7"]  = df["y"].shift(1).rolling(7).max()

Without the shift(1), the rolling window at row t includes the current value y[t], which is the target — a direct label leak.

Leakage taxonomy for time series features

Feature type	Leak risk	Safe construction
lag_k (k ≥ 1)	None if shift is applied	df[“y”].shift(k)
rolling mean without shift	Direct label leak	df[“y”].shift(1).rolling(w).mean()
future exogenous (e.g., tomorrow’s weather)	Look-ahead leak	Only use if genuinely available at forecast time
target-encoded group stats on full data	Cross-row leak	Compute inside each fold

Choosing which lags to include

Start with lag 1, lag 7, lag 14 for daily data (recent memory + weekly pattern). Use the PACF of the target to identify statistically significant lags. For tree-based models (XGBoost, LightGBM), feature importance and SHAP values can prune uninformative lag columns afterward.

Learn it properly Why time series is different

How do you engineer lag and rolling features for a time series ML model, and what leakage risks arise?

Lag features

Rolling (window) features

Leakage taxonomy for time series features

Choosing which lags to include

Keep practising

Explore further