datarekha

Lag & rolling features

How to turn a time series into a plain feature table so any regressor — XGBoost, LightGBM, linear models — can forecast the next value.

9 min read Intermediate Time Series Lesson 13 of 14

What you'll learn

  • Lag features: encoding the recent past as plain numeric columns
  • Rolling-window stats: mean, std, min, max over a trailing window — shifted to prevent leakage
  • Calendar features: day-of-week, month, is_weekend, is_holiday as free signal

Before you start

Why bother? The gradient-boosting toolbox

ARIMA and its relatives are purpose-built for time series, but they are parametric models with a narrow hypothesis class. Gradient-boosted trees — XGBoost, LightGBM, CatBoost — handle non-linear interactions, mixed feature types, and large datasets gracefully. The catch is that they expect a plain (X, y) table with i.i.d. rows. Feature engineering is the bridge that gets you there.

The core idea: one row per forecast horizon

Suppose you have a daily series y_0, y_1, ..., y_T. To predict y_t you construct a row whose columns are derived exclusively from times before t. You stack these rows into a feature matrix X and a target vector y, then train any regressor.

The diagram below shows the mechanics for a window of size 3 producing a one-step-ahead forecast.

Time seriesy₀y₁y₂y₃y₄y₅ ← targetwindow (lag₂, lag₁, lag₀)Feature row for t=5lag_3=y₂lag_2=y₃lag_1=y₄ predict y₅│ sliding window moves one step right for each new row │

A window of width 3 slides over the series. Each position yields one feature row (lag_3, lag_2, lag_1) and one target (the next step).

Lag features

A lag feature is simply the value of the series at some earlier time. For a series y stored in a pandas Series or DataFrame column, .shift(k) shifts the values down by k rows so that row t holds y[t-k].

Common lag choices:

  • Lag 1 — yesterday’s value; captures immediate autocorrelation.
  • Lag 7 — same day last week; captures weekly seasonality.
  • Lag 365 — same day last year; captures annual seasonality.

You are not limited to these; domain knowledge guides which lags matter.

Rolling-window statistics

A rolling window computes an aggregate (mean, std, min, max) over the most recent k observations. This gives the model a sense of local trend and volatility.

In pandas, .rolling(k) looks at the current row and the k-1 rows before it — meaning, without a shift, it includes the current target in the window. That is leakage (covered below). The safe pattern is .shift(1).rolling(k): shift the series by one step first, so the window sits entirely in the past relative to each row’s target.

Expanding windows

An expanding window (.expanding()) grows from the first observation to the current row, computing a cumulative statistic. It is useful for stable long-run means or medians, particularly when the series is stationary. The same shift rule applies.

Calendar features

Calendar features come for free from the timestamp index. Common ones:

  • day_of_week (0=Monday, 6=Sunday)
  • month
  • is_weekend (bool)
  • is_holiday (bool, from a holiday calendar)

They capture seasonality that lags alone may not fully represent. A tree model can learn that Sundays in December behave differently from Sundays in July without you specifying the interaction explicitly.

The critical rule: no leakage

Build a feature table: runnable example

The printed table shows each day as a row. The y column is the target; every other column is a feature built from past values only. After dropping rows with insufficient history (the first 7 days are lost to lag_7), you have a clean (X, y) table. Pass X = df_clean.drop(columns="y") and y = df_clean["y"] to any scikit-learn-compatible estimator.

Handing the table to a model (static overview)

In production you would:

  1. Split with a time-based cutoff — never shuffle (see the prerequisite lesson).
  2. Fit XGBRegressor or LGBMRegressor on the training slice.
  3. At inference time, construct the feature row for the next timestamp using only the history available up to that moment, then call .predict().

Multi-step forecasting (predicting several steps ahead) requires iteratively appending each prediction to the history before computing the next row’s features — or training separate models for each horizon.

What you can do now

You have a repeatable recipe:

  1. Choose lag distances guided by domain knowledge and autocorrelation plots.
  2. Choose rolling window widths (short for local trends, long for slow seasonality).
  3. Add calendar features from the index.
  4. Always .shift(1) before any aggregate that touches the target column.
  5. Drop NaN rows, split in time order, and train any tabular model.

Quick check

0/3
Q1You compute `df['y'].rolling(7).mean()` without shifting and use it as a feature to predict `df['y']`. What is the problem?
Q2A daily retail series shows strong weekly patterns. Which pair of lag features best captures this?
Q3You train an XGBoost model on a feature table built from lag and rolling features and it scores well on the held-out test set. A colleague suggests adding the 7-day centered rolling mean (window centered on the current day) to capture smoother trends. Should you add it?

Practice this in an interview

All questions
How do you engineer lag and rolling features for a time series ML model, and what leakage risks arise?

Lag features shift the target or exogenous variables backward by k steps so the model sees past values as inputs; rolling features (rolling mean, std, min, max) summarise a window of past observations. Both must be computed strictly on past data — any feature that incorporates information from the current or future rows leaks the target and inflates metrics.

How do you extract useful features from datetime columns for a machine learning model?

Raw timestamps are meaningless to most models. Useful features extracted from a datetime column include calendar components (hour, day of week, month, quarter, year), cyclical encodings of periodic components (sin/cos of hour or day-of-week), lag and rolling-window aggregates, time-since-event features, and business-calendar flags like is_weekend or is_holiday.

How do you handle skewed features in a machine learning dataset, and why does skew matter?

Right-skewed features (long tail on the right) concentrate most values near zero while a few extreme values pull the mean up, which distorts distance-based models and linear regression. Common fixes are log, square-root, or Box-Cox transformations that compress the tail and make the distribution closer to normal, improving model convergence and reducing the undue influence of large values.

What regularisation mechanisms does XGBoost add on top of standard gradient boosting?

XGBoost adds L1 (alpha) and L2 (lambda) regularisation on leaf weights directly into the objective function, a minimum child weight that prevents splits on sparse sub-groups, a tree complexity penalty (gamma) that requires a minimum gain before a split is accepted, and column and row subsampling analogous to random forests.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content