datarekha

SARIMAX (exogenous regressors)

Add holiday flags, promo spend, price, and other known external drivers to a seasonal ARIMA model — and learn why forecasting with them requires knowing the future values of those drivers.

9 min read Advanced Time Series Lesson 9 of 14

What you'll learn

  • What the X in SARIMAX means: exogenous regressors are extra input columns, not lags of the target
  • Model form: SARIMA captures autocorrelation in the residual after regressing on the exog variables
  • Practical API: statsmodels SARIMAX, fitting with exog, and passing future exog to .forecast()

Before you start

The X in SARIMAX

SARIMAX is Seasonal AutoRegressive Integrated Moving Average with eXogenous regressors. Everything up to the X is the SARIMA you already know. The X adds one new idea:

An exogenous variable (from the Greek for “originating outside”) is a variable that influences your target series but is not itself predicted by the model. A holiday flag, promotional spend, a competitor’s price, or outdoor temperature are all exogenous to your sales. You observe or plan them ahead of time; they are inputs, not outputs.

Contrast this with the endogenous variable — the series you are trying to forecast (sales, in this example). The ARIMA terms model the endogenous variable’s own past. The exogenous terms model the additive lift or drag that each external driver provides on top of that.

How the model is structured

At its core SARIMAX combines two ideas in a single fit:

  1. Regression on exog. Each exogenous column gets its own coefficient, exactly like ordinary linear regression. The model computes a weighted sum of the external signals and subtracts it from the target.

  2. SARIMA on the residual. After removing the exog contribution, whatever is left is modelled with the full seasonal ARIMA structure: AR lags, differencing, MA error terms, and their seasonal counterparts at period s.

In words: the SARIMA part explains the autocorrelation that remains once the known external drivers are accounted for. The two pieces are estimated simultaneously in a single maximum-likelihood optimisation, so each part gets exactly the credit it deserves.

This means you can think of SARIMAX as answering two questions at once:

  • How much does each external driver move the series? (regression coefficients)
  • What pattern does the series follow on its own, after those drivers are removed? (SARIMA structure)

When exogenous regressors help

Adding exog variables is worthwhile when:

  • You have known external drivers that repeat or can be planned (holidays, promotions, scheduled price changes, weather forecasts).
  • Those drivers cause systematic deviations that your SARIMA residuals currently flag as unexplained spikes.
  • You can supply future values of the drivers at forecast time — either because they are known (a public holiday calendar) or because you have a reliable separate forecast for them (a 7-day weather forecast).

Exog variables do not help when the driver is itself unpredictable, when you do not know its future value, or when its relationship with the target is highly nonlinear (in which case tree-based or neural models may be more appropriate).

The big trap: you must supply future exog

Diagram: external regressors feeding into the forecast

Holiday flagknown calendarPromo spendplanned budgetSales (past)endogenous lagsSARIMAXregression + SARIMAon residualForecastfuture exog required at forecast time

External regressors (holiday flag, promo spend) and the series’ own past both feed into SARIMAX. At forecast time the future rows of those regressors must be supplied.

Fitting SARIMAX in Python

The snippet below shows the complete workflow: build exogenous feature columns, fit the model, then pass the future rows of those same columns when forecasting.

import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX

# --- Load target series and build exog columns ---
df = pd.read_csv("weekly_sales.csv", index_col="week", parse_dates=True)

# Exogenous feature matrix (aligned to the training index)
X_train = df[["is_holiday", "promo_spend"]]
y_train = df["sales"]

# --- Fit SARIMAX ---
# order=(p,d,q) for the non-seasonal part
# seasonal_order=(P,D,Q,s) for the seasonal part; here s=52 for weekly data
model = SARIMAX(
    y_train,
    exog=X_train,
    order=(1, 1, 1),
    seasonal_order=(1, 1, 1, 52),
    enforce_stationarity=False,
    enforce_invertibility=False,
)
result = model.fit(disp=False)
print(result.summary())

# --- Forecast the next 8 weeks ---
# You MUST provide X_future: 8 rows, same columns as X_train
X_future = pd.DataFrame({
    "is_holiday": [0, 0, 1, 0, 0, 0, 1, 0],   # two holidays in forecast window
    "promo_spend": [5000, 5000, 8000, 5000, 5000, 5000, 9000, 5000],
}, index=pd.date_range(df.index[-1], periods=8, freq="W"))

forecast = result.forecast(steps=8, exog=X_future)
print(forecast)

Reading the summary output

After fitting, result.summary() shows:

  • Regression coefficients for each exog column (labelled by column name). A positive coefficient on is_holiday means holidays lift sales on average; the magnitude is in the same units as your target.
  • AR, MA, seasonal AR, seasonal MA coefficients for the SARIMA part — these operate on the residual after the exog contribution is removed.
  • AIC / BIC for model selection. Compare candidate orders by fitting multiple models and choosing the lowest AIC, holding the exog columns constant.
  • Ljung-Box Q in the diagnostics — as always, the residuals should be white noise.

Diagnosing the model

result.plot_diagnostics(figsize=(12, 8))

Inspect the residual ACF panel. If you still see systematic spikes — especially at the seasonal period — the SARIMA order needs adjusting. If you see a spike at lag 1 that disappears once you add the exog columns, the exog variables were absorbing autocorrelation that your SARIMA was missing, which is a sign they are genuinely helping.

Choosing your exogenous variables

Not every variable you might imagine deserves a slot in the exog matrix. A practical checklist:

  • Is it known or forecastable ahead of time? A national holiday calendar is fixed years in advance. The spot price of a commodity might require a separate model to forecast, which introduces compounding error.
  • Is its effect stable over time? If the relationship between promo spend and sales changes every year, a fixed linear coefficient will misfit.
  • Does it reduce residual autocorrelation? Run the model with and without the candidate variable and compare the residual ACF and AIC. If neither improves, the variable is not adding useful signal beyond what the SARIMA terms already capture.
  • Is it collinear with seasonal terms? A variable that fires every December is nearly collinear with the seasonal AR terms at s=12. The model will still fit, but the coefficients will be unstable and hard to interpret.

Putting it all together

The mental model to carry forward:

  • Exogenous regressors are input columns you supply — holiday flags, promo spend, temperature, price — that the model treats as known drivers rather than things to forecast from the series itself.
  • SARIMAX = linear regression on exog + SARIMA on the remaining autocorrelation, estimated together.
  • Forecasting requires future exog rows. Only include regressors whose future values you can reliably obtain.
  • Use AIC and residual ACF to confirm the exog variables are actually helping, not just adding noise.

Quick check

0/3
Q1What distinguishes an exogenous variable from the endogenous variable in a SARIMAX model?
Q2You fit SARIMAX(1,1,1)(1,1,1,52) with a 'promo_spend' exog column and call result.forecast(steps=4). The call raises an error about missing exog. What went wrong and how do you fix it?
Q3A data scientist adds daily temperature as an exog variable to a SARIMAX model for ice-cream sales. AIC improves and the residual ACF looks clean. Three months later the model performs poorly in production. Which of the following is the most likely root cause?

Practice this in an interview

All questions
What is the difference between ARIMA and SARIMA, and when do you use each?

ARIMA(p,d,q) models non-seasonal series by combining autoregression, differencing, and a moving average of errors. SARIMA extends it with a second set of seasonal parameters (P,D,Q,s) that operate at the seasonal lag s, handling periodic patterns that ARIMA alone cannot capture.

What is a VAR model, and when would you use it instead of a univariate ARIMA?

A Vector Autoregression (VAR) model extends ARIMA to multiple time series simultaneously: each variable is regressed on its own past values and the past values of all other variables in the system. Use VAR when the series have mutual predictive relationships (Granger-causality) and you want to model those interactions; ARIMA is sufficient when one series can be forecast in isolation.

When would you choose Prophet over ARIMA for a forecasting problem?

Prophet is a curve-fitting model that decomposes the series into trend, seasonality, and holidays; it handles missing data, multiple seasonalities, and non-uniform time grids with minimal tuning and is accessible to non-statisticians. ARIMA is a statistical model based on autocorrelation structure; it is more appropriate when the series is short, noise is small, and you need principled uncertainty intervals from an explicit stochastic process.

What are trend, seasonality, and residual in time series decomposition, and how do you extract them?

Decomposition separates a series into a trend component (long-run direction), a seasonal component (periodic, fixed-period pattern), and a residual (everything left over). Additive decomposition sums the three; multiplicative decomposition multiplies them, which is appropriate when seasonal swings grow with the level.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content