How do you engineer lag and rolling features for a time series ML model, and what leakage risks arise?

Lag features shift the target or exogenous variables backward by k steps so the model sees past values as inputs; rolling features (rolling mean, std, min, max) summarise a window of past observations. Both must be computed strictly on past data — any feature that incorporates information from the current or future rows leaks the target and inflates metrics.

How do you extract useful features from datetime columns for a machine learning model?

Raw timestamps are meaningless to most models. Useful features extracted from a datetime column include calendar components (hour, day of week, month, quarter, year), cyclical encodings of periodic components (sin/cos of hour or day-of-week), lag and rolling-window aggregates, time-since-event features, and business-calendar flags like is_weekend or is_holiday.

Which models require feature scaling and which don't, and why?

Distance-based and gradient-based models (KNN, K-means, SVM, PCA, linear/logistic regression with regularization, neural networks) need scaling because they're sensitive to feature magnitudes. Tree-based models (decision trees, random forests, gradient boosting) are scale-invariant because they split on thresholds per feature. Standardization and min-max scaling are the usual choices, fit on training data only.

What is feature engineering, and can you walk through how you'd engineer features to improve a model?

Feature engineering is creating, transforming, or selecting input variables so a model can capture patterns more easily. Common techniques include scaling, encoding categoricals, binning, interaction and ratio features, date/time decomposition, and domain-derived aggregates. It often matters more than the choice of algorithm because models can only learn from the signal present in their inputs.

Lag & rolling features — Time Series

Every model so far — ARIMA, SARIMA, ETS, Prophet — was purpose-built for time series. The last lesson ended on the obvious itch: what about the general-purpose regressors (XGBoost, LightGBM) that win every tabular competition? They can’t read a raw series; they need a flat (X, y) table. This lesson is the bridge that builds it.

Why bother? The gradient-boosting toolbox

ARIMA and its relatives are purpose-built for time series, but they are parametric models with a narrow hypothesis class. Gradient-boosted trees — XGBoost, LightGBM, CatBoost — handle non-linear interactions, mixed feature types, and large datasets gracefully. The catch is that they expect a plain (X, y) table with i.i.d. rows. Feature engineering is the bridge that gets you there.

The core idea: one row per forecast horizon

Suppose you have a daily series y_0, y_1, ..., y_T. To predict y_t you construct a row whose columns are derived exclusively from times before t. You stack these rows into a feature matrix X and a target vector y, then train any regressor.

The diagram below shows the mechanics for a window of size 3 producing a one-step-ahead forecast.

A window of width 3 slides over the series. Each position yields one feature row (lag_3, lag_2, lag_1) and one target (the next step).

Lag features

A lag feature is simply the value of the series at some earlier time. For a series y stored in a pandas Series or DataFrame column, .shift(k) shifts the values down by k rows so that row t holds y[t-k].

Common lag choices:

Lag 1 — yesterday’s value; captures immediate autocorrelation.
Lag 7 — same day last week; captures weekly seasonality.
Lag 365 — same day last year; captures annual seasonality.

You are not limited to these; domain knowledge guides which lags matter.

Rolling-window statistics

A rolling window computes an aggregate (mean, std, min, max) over the most recent k observations. This gives the model a sense of local trend and volatility.

In pandas, .rolling(k) looks at the current row and the k-1 rows before it — meaning, without a shift, it includes the current target in the window. That is leakage (covered below). The safe pattern is .shift(1).rolling(k): shift the series by one step first, so the window sits entirely in the past relative to each row’s target.

Expanding windows

An expanding window (.expanding()) grows from the first observation to the current row, computing a cumulative statistic. It is useful for stable long-run means or medians, particularly when the series is stationary. The same shift rule applies.

Calendar features

Calendar features come for free from the timestamp index. Common ones:

day_of_week (0=Monday, 6=Sunday)
month
is_weekend (bool)
is_holiday (bool, from a holiday calendar)

They capture seasonality that lags alone may not fully represent. A tree model can learn that Sundays in December behave differently from Sundays in July without you specifying the interaction explicitly.

The critical rule: no leakage

Build a feature table: runnable example

import numpy as np
import pandas as pd

# Small deterministic daily series
np.random.seed(42)
dates = pd.date_range("2024-01-01", periods=14, freq="D")
y = pd.Series(
    [120, 135, 128, 142, 150, 148, 130, 125, 138, 145, 152, 160, 155, 162],
    index=dates,
    name="sales",
)

df = pd.DataFrame({"y": y})

# --- Lag features ---
df["lag_1"]   = df["y"].shift(1)   # yesterday
df["lag_7"]   = df["y"].shift(7)   # same day last week

# --- Rolling features (shift FIRST to avoid leakage) ---
shifted = df["y"].shift(1)
df["roll_mean_3"] = shifted.rolling(3).mean()
df["roll_std_3"]  = shifted.rolling(3).std()
df["roll_max_3"]  = shifted.rolling(3).max()

# --- Expanding mean (also shifted) ---
df["expand_mean"] = shifted.expanding().mean()

# --- Calendar features ---
df["day_of_week"] = df.index.day_of_week   # 0=Mon
df["month"]       = df.index.month
df["is_weekend"]  = (df.index.day_of_week >= 5).astype(int)

# Drop rows where lags are NaN (not enough history)
df_clean = df.dropna()

print("Feature table (first 5 rows after dropping NaN):")
print(df_clean.head().to_string())
print(f"\nShape: {df_clean.shape}  (rows x columns)")
print("\nColumn list:", df_clean.columns.tolist())

Feature table (first 5 rows after dropping NaN):
              y  lag_1  lag_7  roll_mean_3  roll_std_3  roll_max_3  expand_mean  day_of_week  month  is_weekend
2024-01-08  125  130.0  120.0   142.666667   11.015141       150.0   136.142857            0      1           0
2024-01-09  138  125.0  135.0   134.333333   12.096832       148.0   134.750000            1      1           0
2024-01-10  145  138.0  128.0   131.000000    6.557439       138.0   135.111111            2      1           0
2024-01-11  152  145.0  142.0   136.000000   10.148892       145.0   136.100000            3      1           0
2024-01-12  160  152.0  150.0   145.000000    7.000000       152.0   137.545455            4      1           0

Shape: (7, 10)  (rows x columns)

Column list: ['y', 'lag_1', 'lag_7', 'roll_mean_3', 'roll_std_3', 'roll_max_3', 'expand_mean', 'day_of_week', 'month', 'is_weekend']

The table shows each day as a row. The y column is the target; every other column is a feature built from past values only — note lag_1 on 2024-01-08 holds 130 (the previous day), and roll_mean_3 averages the three days before each row, never the row itself. After dropping rows with insufficient history (the first 7 days are lost to lag_7, leaving 7 rows), you have a clean (X, y) table. Pass X = df_clean.drop(columns="y") and y = df_clean["y"] to any scikit-learn-compatible estimator.

Handing the table to a model (static overview)

In production you would:

Split with a time-based cutoff — never shuffle (see the prerequisite lesson).
Fit XGBRegressor or LGBMRegressor on the training slice.
At inference time, construct the feature row for the next timestamp using only the history available up to that moment, then call .predict().

Multi-step forecasting (predicting several steps ahead) requires iteratively appending each prediction to the history before computing the next row’s features — or training separate models for each horizon.

What you can do now

You have a repeatable recipe:

Choose lag distances guided by domain knowledge and autocorrelation plots.
Choose rolling window widths (short for local trends, long for slow seasonality).
Add calendar features from the index.
Always .shift(1) before any aggregate that touches the target column.
Drop NaN rows, split in time order, and train any tabular model.

In one breath

Reframe forecasting as supervised learning: each row’s target is the next value, and its features are built only from the past, so any tabular model (XGBoost, LightGBM, linear) can forecast. The toolkit is three families — lag features (.shift(k): lag 1 = yesterday, lag 7 = last week, lag 365 = last year), rolling-window stats (mean/std/min/max over a trailing window for local trend and volatility), and calendar features (day-of-week, month, is_weekend, is_holiday) free from the index. The one rule that governs everything is no leakage: a feature at time t must never touch y[t] or later — so always .shift(1) before any .rolling()/.expanding(), never use centered windows, and split in time order, never shuffle. Build the table, drop the NaN warm-up rows, and hand it to any regressor.

Practice

Quick check

0/3

Q1You compute `df['y'].rolling(7).mean()` without shifting and use it as a feature to predict `df['y']`. What is the problem?

Q2A daily retail series shows strong weekly patterns. Which pair of lag features best captures this?

Q3You train an XGBoost model on a feature table built from lag and rolling features and it scores well on the held-out test set. A colleague suggests adding the 7-day centered rolling mean (window centered on the current day) to capture smoother trends. Should you add it?

A question to carry forward

You now have an embarrassment of riches: ARIMA, SARIMA, ETS, Prophet, and now any tabular model fed a lag-and-rolling feature table — five different ways to forecast the same series. But that abundance hides a sharper question. Which one is actually best? And notice how easy it’s been, this whole section, to fool yourself: a shuffled split here, an unshifted rolling mean there, and validation scores look brilliant while production collapses.

So the final question of the time-series section is the one that decides whether any of this works: how do you measure a forecaster honestly — choosing the right error metric and a validation scheme that never lets the future leak into the past? The last lesson, evaluating forecasts (walk-forward), covers MAE vs RMSE vs MAPE, why ordinary k-fold is forbidden, and the rolling walk-forward backtest that mirrors how your model will really be used.

Lag & rolling features

What you'll learn

Before you start