How do you choose p, d, and q for an ARIMA model?
Choose d by differencing until the ADF test confirms stationarity; choose p from the PACF cutoff and q from the ACF cutoff on the differenced series; then confirm with AIC or BIC to guard against over-fitting. In practice, an automated grid search over a small range of candidates with information criteria is more reliable than visual inspection alone.
How to think about it
Walk through the three steps in order: d first (stationarity), then p and q (plots + AIC). Interviewers want a principled workflow, not “I just try a few values.”
Step 1 — choose d (differencing order)
Run the ADF test on the raw series. If p-value > 0.05, apply first-order differencing and retest. Repeat until the series is stationary. Most economic and business series need d=0 or d=1; d=2 is rare and risks over-differencing.
from statsmodels.tsa.stattools import adfuller
def find_d(series, max_d=2):
for d in range(max_d + 1):
p_val = adfuller(series.dropna())[1]
if p_val < 0.05:
return d, series
series = series.diff()
return max_d, series
Step 2 — choose p and q (visual heuristic)
On the stationary (differenced) series:
- Plot the PACF: the lag where spikes first fall inside the confidence band gives a candidate p.
- Plot the ACF: the lag where spikes first fall inside the confidence band gives a candidate q.
These are starting candidates, not hard answers.
Step 3 — confirm with AIC/BIC
Fit models over a small grid (e.g., p, q ∈ {0,1,2}) and pick the one with the lowest AIC (prefers fit) or BIC (penalises complexity more).
import itertools
from statsmodels.tsa.arima.model import ARIMA
import warnings
best_aic, best_order = float("inf"), None
for p, q in itertools.product(range(3), repeat=2):
try:
with warnings.catch_warnings():
warnings.simplefilter("ignore")
aic = ARIMA(train, order=(p, d, q)).fit().aic
if aic < best_aic:
best_aic, best_order = aic, (p, d, q)
except Exception:
pass
print("Best order:", best_order, "AIC:", round(best_aic, 2))
Common pitfalls
| Mistake | Effect |
|---|---|
| Choosing d without testing | Under- or over-differencing; spurious or noisy series |
| Reading ACF/PACF on the raw (non-stationary) series | Misleading plots — all lags appear correlated |
| Using only AIC, ignoring residual diagnostics | Model fits history but residuals are still autocorrelated |