Stationarity, differencing, and why ARIMA needs a flat series

Stock analysts who built linear-trend models on 2019 retail sales walked into 2020 carrying the wrong weapon. The series they thought they understood had a hidden assumption embedded in every coefficient: that the statistical properties of the data — its average level, its volatility, the way one week relates to the next — would stay roughly stable over time. When that assumption broke, forecasts did not just miss; they missed in confident, systematic ways. That buried assumption is stationarity, and understanding it is the price of admission to rigorous time series forecasting.

What stationarity actually means

A time series is (weakly) stationary when three properties hold for every point in time:

The mean is constant — the series fluctuates around a stable center rather than drifting upward or downward.
The variance is constant — the amplitude of fluctuations does not grow or shrink over time.
The autocovariance between any two observations depends only on the lag between them, not on where in time those observations fall.

That third property is subtler than it sounds. It means the relationship between Monday and the following Monday is the same whether you are looking at January or July — the series has no memory of where it started. When that holds, a model trained on the first half of your data is genuinely applicable to the second half.

Most real-world economic and business series violate at least one of these conditions right out of the database. Monthly revenue usually climbs over years (non-constant mean). Volatility in financial returns clusters: calm periods are followed by calm periods, and turbulent periods by turbulent ones (non-constant variance). Holiday retail data has December spikes that change the autocovariance structure depending on time of year (season-dependent autocovariance). See the glossary for a concise definition of each term if any of these feel unfamiliar.

Why ARIMA — and most classical models — need stationarity

The ARIMA model is built from two stationarity-assuming components: an autoregressive (AR) part and a moving-average (MA) part. The AR part regresses the current value on its own past values. The MA part regresses the current value on past forecast errors. Both of those regressions implicitly assume the coefficients are stable over time — which is only plausible if the series is stationary.

Train an AR model on a trending series and the model will learn to chase the trend rather than the structure around the trend. Roll it forward one period and it will extrapolate the trend blindly, with no mechanism to detect or correct when the trend changes slope or reverses. The statistical machinery works; the forecast is just answering the wrong question.

Stationarity is also what makes the algebra of model estimation well-behaved. If the variance grows without bound, the ordinary least-squares estimator used inside AR and MA fitting becomes inconsistent. The confidence intervals you compute for your coefficients are no longer reliable. The whole inferential superstructure rests on a cracked foundation.

Trend and seasonality as the two main offenders

A linear trend adds a time-dependent increment to the mean at every step. A seasonal pattern adds a periodic, time-dependent component to the mean at every cycle. Both destroy the constant-mean requirement. Strong seasonality can additionally change the autocovariance structure because the correlation between observations that are one lag apart differs depending on whether that lag crosses a seasonal boundary.

The figure below contrasts a synthetic monthly series with both trend and seasonality against its differenced counterpart — already you can see the qualitative difference between a series that is going somewhere and one that fluctuates around a fixed level.

A synthetic monthly series with a rising trend and mild seasonality (left) versus the same series after first differencing (right). The differenced series fluctuates around a stable mean near zero.

Differencing: modelling change rather than level

The first difference of a series replaces each value with the change since the previous period:

import pandas as pd

df["y_diff"] = df["y"].diff()        # y_t - y_{t-1}
df["y_diff2"] = df["y_diff"].diff()  # second difference

Why does this work? A linear trend means y_t = alpha + beta*t + noise_t. The first difference is y_t - y_{t-1} = beta + (noise_t - noise_{t-1}). The deterministic slope beta becomes a constant that disappears into the mean of the differenced series. If noise_t is itself stationary, the differenced series is stationary. First differencing removes linear trends; second differencing removes quadratic trends. Applying differencing more than twice almost never improves things in practice.

Seasonal differencing removes a periodic pattern of period s by computing y_t - y_{t-s}. For monthly data with annual seasonality, that is y_t - y_{t-12}. This strips away the seasonal level shift for each calendar month, leaving only non-seasonal variation. In the SARIMA notation the seasonal differencing order is denoted D and is kept separate from the non-seasonal d.

The deeper intuition is this: levels encode cumulative history; changes encode local dynamics. A random walk has levels that drift arbitrarily far but changes that are pure white noise. Models trained on changes generalize because each change is drawn from the same distribution regardless of where in the walk you are. Models trained on levels get confused because the level today is correlated with the entire past trajectory of the series.

Testing for stationarity: ADF and KPSS

Never rely on visual inspection alone. Two complementary tests together give you a reliable diagnosis.

Augmented Dickey-Fuller (ADF) tests whether the series has a unit root. A unit root process is one where shocks accumulate permanently — the simplest example being the random walk y_t = y_{t-1} + epsilon_t. The ADF null hypothesis is that the series has a unit root (is non-stationary). Rejecting the null (small p-value) is evidence of stationarity.

KPSS (Kwiatkowski-Phillips-Schmidt-Shin) tests the opposite direction. Its null hypothesis is that the series is stationary around a deterministic trend or level. Rejecting the null (small p-value) is evidence of non-stationarity.

from statsmodels.tsa.stattools import adfuller, kpss

result_adf = adfuller(series, autolag="AIC")
print(f"ADF stat: {result_adf[0]:.4f}, p-value: {result_adf[1]:.4f}")

result_kpss = kpss(series, regression="c", nlags="auto")
print(f"KPSS stat: {result_kpss[0]:.4f}, p-value: {result_kpss[1]:.4f}")

p value < 0.05 on ADF: reject unit-root null, evidence of stationarity. p value < 0.05 on KPSS: reject stationary null, evidence of non-stationarity.

Run both on the raw series; if you need to difference, run both again on the differenced series to confirm the transformation achieved its goal.

The ARIMA(p,d,q) anatomy

The three parameters of an ARIMA model each handle a distinct structural piece of the series.

The three components of ARIMA(p,d,q): AR captures autocorrelation from p past values, I removes trend via d-order differencing, and MA absorbs q past forecast errors. Together they produce a forecast on the stationary, differenced series.

d (the integration order) is the number of times you must difference the series to achieve stationarity. This is what the ADF/KPSS battery determines. d=0 means the raw series is already stationary; d=1 is the most common case (one round of first differencing); d=2 is occasionally needed for strongly curved trends.

p (the AR order) controls how many past values of the differenced series feed into the current prediction. This models persistence in the stationary residuals — the autoregressive structure that remains after removing trend and seasonality. The partial autocorrelation function (PACF) of the differenced series is the diagnostic tool for choosing p.

q (the MA order) controls how many past forecast errors feed into the current prediction. This models how quickly the series recovers from shocks. The autocorrelation function (ACF) of the differenced series guides the choice of q.

Over-differencing: a real risk

A practical guard: after differencing, compute the standard deviation of the differenced series. If it is larger than the standard deviation of the raw series, stop — you have already over-differenced. The ADF and KPSS tests will also show the series as stationary one differencing step earlier if you run them systematically.

Log and Box-Cox transforms for variance stabilization

Differencing handles non-constant mean. Non-constant variance requires a separate transformation applied before differencing.

The log transform is the most common choice. If the series is strictly positive and its variance grows roughly proportionally to its level (heteroscedasticity that scales with the series), log(y_t) compresses the high-level, high-variance region and expands the low-level, low-variance region until variance is approximately constant.

The Box-Cox family generalizes this:

from scipy.stats import boxcox
import numpy as np

y_transformed, lam = boxcox(series)
print(f"Optimal lambda: {lam:.3f}")
# lambda near 0 => log transform
# lambda near 0.5 => square-root transform
# lambda near 1 => no transform needed

Apply the transform, verify variance is stable across time windows, then apply differencing. Both transformations must be inverted when you convert forecasts back to the original scale — a step that is easy to forget and catastrophic to omit.

The practical recipe: plot → test → difference → re-test → model → invert

Working through a real forecasting problem with a stationary mindset follows six steps:

Plot the raw series. Look for trend, changing variance, and obvious seasonality. This tells you what transformations to expect, not which ones to apply.
Apply a variance-stabilizing transform if the spread of the series grows over time. Log is the default choice for positive data with multiplicative seasonality.
Run ADF and KPSS on the (possibly transformed) series. If ADF fails to reject (large p-value) and KPSS rejects (small p-value), proceed to differencing.
Difference once. For seasonal data with period s, combine first differencing and seasonal differencing: diff(series, 1) then diff(result, s) — or diff(series, s) first if the seasonal component is dominant.
Re-run ADF and KPSS on the differenced series. Both should now confirm stationarity. If not, difference again, but verify you are not over-differencing.
Fit the ARIMA on the stationary series. Choose p from the PACF, q from the ACF. Evaluate on a held-out window using MAE or RMSE. When generating predictions, invert all transforms in the reverse order of application — difference inversion first, then Box-Cox or log inversion.

from statsmodels.tsa.arima.model import ARIMA

model = ARIMA(series_stationary, order=(p, 0, q))
result = model.fit()
forecast_diff = result.forecast(steps=12)

# Invert differencing: cumulative sum from last known value
last_val = series_original.iloc[-1]
forecast_original = last_val + forecast_diff.cumsum()

Notice d=0 in the ARIMA call above — the differencing was done manually before fitting so that the inversion logic is explicit and auditable. Some practitioners let ARIMA handle differencing internally (d=1); statsmodels does support this, but separating the steps makes the inversion harder to inspect.

Frequently asked questions

Does stationarity mean the series looks flat?

Not quite. A stationary series can oscillate, spike, and autocorrelate — it just cannot drift systematically upward or downward, and its variance cannot grow or shrink over time. White noise is trivially stationary. A mean-reverting oscillation is stationary. A slow upward drift is not.

What if ADF and KPSS disagree?

Disagreement (ADF rejects but KPSS also rejects, or neither rejects) usually signals a near-unit-root situation, a structural break, or a short sample. Inspect the series visually and check for level shifts. With a structural break present, consider segmenting the series or using a model that accommodates breaks rather than forcing a single differencing order.

Can I use differencing on any frequency of data — daily, weekly, monthly?

Yes. The lag used for seasonal differencing should match the periodicity of the dominant seasonal cycle: s=7 for day-of-week patterns in daily data, s=12 for month-of-year patterns in monthly data, s=4 for quarterly data. Multiple seasonal periods (daily data with both weekly and annual patterns) require more sophisticated approaches such as TBATS or STL decomposition before ARIMA fitting.

Does ARIMA handle all forms of non-stationarity?

ARIMA handles stochastic trends (unit roots) and deterministic polynomial trends through differencing. It does not handle explosive processes (variance growing without bound faster than a random walk), long-memory processes (where autocorrelations decay very slowly), or heteroscedastic variance that changes over time — for the last, ARCH/GARCH models are the standard tool. Always match the model class to the actual structure in the data.