Lag & rolling features
How to turn a time series into a plain feature table so any regressor — XGBoost, LightGBM, linear models — can forecast the next value.
What you'll learn
- Lag features: encoding the recent past as plain numeric columns
- Rolling-window stats: mean, std, min, max over a trailing window — shifted to prevent leakage
- Calendar features: day-of-week, month, is_weekend, is_holiday as free signal
Before you start
Why bother? The gradient-boosting toolbox
ARIMA and its relatives are purpose-built for time series, but they are parametric models with a narrow hypothesis class. Gradient-boosted trees — XGBoost, LightGBM, CatBoost — handle non-linear interactions, mixed feature types, and large datasets gracefully. The catch is that they expect a plain (X, y) table with i.i.d. rows. Feature engineering is the bridge that gets you there.
The core idea: one row per forecast horizon
Suppose you have a daily series y_0, y_1, ..., y_T. To predict y_t you construct a row whose columns are derived exclusively from times before t. You stack these rows into a feature matrix X and a target vector y, then train any regressor.
The diagram below shows the mechanics for a window of size 3 producing a one-step-ahead forecast.
A window of width 3 slides over the series. Each position yields one feature row (lag_3, lag_2, lag_1) and one target (the next step).
Lag features
A lag feature is simply the value of the series at some earlier time. For a series y stored in a pandas Series or DataFrame column, .shift(k) shifts the values down by k rows so that row t holds y[t-k].
Common lag choices:
- Lag 1 — yesterday’s value; captures immediate autocorrelation.
- Lag 7 — same day last week; captures weekly seasonality.
- Lag 365 — same day last year; captures annual seasonality.
You are not limited to these; domain knowledge guides which lags matter.
Rolling-window statistics
A rolling window computes an aggregate (mean, std, min, max) over the most recent k observations. This gives the model a sense of local trend and volatility.
In pandas, .rolling(k) looks at the current row and the k-1 rows before it — meaning, without a shift, it includes the current target in the window. That is leakage (covered below). The safe pattern is .shift(1).rolling(k): shift the series by one step first, so the window sits entirely in the past relative to each row’s target.
Expanding windows
An expanding window (.expanding()) grows from the first observation to the current row, computing a cumulative statistic. It is useful for stable long-run means or medians, particularly when the series is stationary. The same shift rule applies.
Calendar features
Calendar features come for free from the timestamp index. Common ones:
day_of_week(0=Monday, 6=Sunday)monthis_weekend(bool)is_holiday(bool, from a holiday calendar)
They capture seasonality that lags alone may not fully represent. A tree model can learn that Sundays in December behave differently from Sundays in July without you specifying the interaction explicitly.
The critical rule: no leakage
Build a feature table: runnable example
The printed table shows each day as a row. The y column is the target; every other column is a feature built from past values only. After dropping rows with insufficient history (the first 7 days are lost to lag_7), you have a clean (X, y) table. Pass X = df_clean.drop(columns="y") and y = df_clean["y"] to any scikit-learn-compatible estimator.
Handing the table to a model (static overview)
In production you would:
- Split with a time-based cutoff — never shuffle (see the prerequisite lesson).
- Fit
XGBRegressororLGBMRegressoron the training slice. - At inference time, construct the feature row for the next timestamp using only the history available up to that moment, then call
.predict().
Multi-step forecasting (predicting several steps ahead) requires iteratively appending each prediction to the history before computing the next row’s features — or training separate models for each horizon.
What you can do now
You have a repeatable recipe:
- Choose lag distances guided by domain knowledge and autocorrelation plots.
- Choose rolling window widths (short for local trends, long for slow seasonality).
- Add calendar features from the index.
- Always
.shift(1)before any aggregate that touches the target column. - Drop NaN rows, split in time order, and train any tabular model.
Quick check
Practice this in an interview
All questionsLag features shift the target or exogenous variables backward by k steps so the model sees past values as inputs; rolling features (rolling mean, std, min, max) summarise a window of past observations. Both must be computed strictly on past data — any feature that incorporates information from the current or future rows leaks the target and inflates metrics.
Raw timestamps are meaningless to most models. Useful features extracted from a datetime column include calendar components (hour, day of week, month, quarter, year), cyclical encodings of periodic components (sin/cos of hour or day-of-week), lag and rolling-window aggregates, time-since-event features, and business-calendar flags like is_weekend or is_holiday.
Right-skewed features (long tail on the right) concentrate most values near zero while a few extreme values pull the mean up, which distorts distance-based models and linear regression. Common fixes are log, square-root, or Box-Cox transformations that compress the tail and make the distribution closer to normal, improving model convergence and reducing the undue influence of large values.
XGBoost adds L1 (alpha) and L2 (lambda) regularisation on leaf weights directly into the objective function, a minimum child weight that prevents splits on sparse sub-groups, a tree complexity penalty (gamma) that requires a minimum gain before a split is accepted, and column and row subsampling analogous to random forests.