Moving average (MA)
How MA(q) models the lingering effect of past shocks — and why it is nothing like a rolling-average smoother.
What you'll learn
- MA(q) formula: y_t equals the mean plus the current shock plus a weighted sum of the q previous shocks
- Why MA captures short-lived surprise effects while AR captures level persistence
- ACF cuts off at lag q — the signature diagnostic that identifies an MA process
Before you start
The MA(q) model in plain language
An autoregressive model says: today’s value is a weighted sum of recent past values plus noise. If sales were high yesterday they pull today’s sales up.
A moving-average model says something completely different: today’s value is the long-run mean plus a weighted sum of recent past surprises — not levels. A surprise, also called a shock or innovation, is the gap between what actually happened and what the model expected. Formally it is the white-noise error term, often written e_t (white noise means zero mean, constant variance, and no autocorrelation across time — each e_t is an independent draw from the same distribution).
The MA(q) model:
y_t = mu + e_t + theta_1 * e_(t-1) + theta_2 * e_(t-2) + ... + theta_q * e_(t-q)
- mu is the unconditional mean of the series.
- e_t is the current shock (white noise, unknown until time t).
- e_(t-1) through e_(t-q) are the q most recent past shocks.
- theta_1 through theta_q are the MA coefficients (the weights on past shocks).
The order q is the “memory” of the model: how many past shocks still influence the present.
Why shocks, not levels?
Think of a central bank making an unexpected interest-rate announcement. That surprise ripples through markets for a day or two, then fades. The series does not stay permanently elevated — it absorbs the shock and returns to baseline. That is an MA signature: a short, sharp, finite-duration effect.
Compare that to AR, where a high value today raises the expected value tomorrow, which in turn raises the one after that, and so on. AR captures persistence in levels — effects can linger for many steps. MA captures persistence of surprises — effects are finite and die out after exactly q lags.
This distinction matters because the real world has both: a shock can push a series away from its mean (MA) while the series simultaneously tends to drift back toward it (AR). Combining them is exactly what ARIMA does.
The big naming confusion
How MA and AR look in a diagram
The diagram below contrasts the two feedback structures. In AR, past values of y loop back as inputs. In MA, past shocks (e) feed forward into the current value — there is no feedback from past y at all.
Top: MA(1) — only past shocks (e) feed into y_t; there is no recurrence on past y. Bottom: AR(1) shown for contrast — past values of y loop back. The two structures are genuinely different, not just reparameterisations.
The ACF signature: cuts off at lag q
Every model leaves a fingerprint in the autocorrelation function (ACF) — the correlation between y_t and y_(t-k) at various lags k.
For an MA(q) process the math works out cleanly: y_t shares error terms with y_(t-1) through y_(t-q) (they overlap in their shock windows), but y_t shares no error terms at all with y_(t-q-1) or anything earlier. That means:
- ACF is non-zero at lags 1 through q.
- ACF drops to (approximately) zero at lag q+1 and beyond.
This sharp cutoff is the fingerprint. If you plot an ACF and see significant bars only for the first q lags, suspect an MA(q) process.
Contrast with AR: AR’s PACF cuts off sharply at lag p, while its ACF decays gradually. MA and AR are mirror images in the ACF/PACF diagnostic table.
| ACF | PACF | |
|---|---|---|
| AR(p) | Decays gradually (tails off) | Cuts off after lag p |
| MA(q) | Cuts off after lag q | Decays gradually (tails off) |
| ARMA(p,q) | Tails off | Tails off |
Short memory is a feature, not a bug
An MA process has short memory by construction. No matter how large the theta coefficients are, the effect of any shock vanishes after exactly q steps. This makes MA models appropriate when shocks are transient — news events, measurement errors, one-off disruptions that are absorbed quickly.
AR models, by contrast, can have long memory depending on their coefficients: a near-unit-root AR(1) with phi close to 1 transmits shocks forward for many steps. Choosing between AR and MA (or combining them in ARMA/ARIMA) is largely a question of how fast the series recovers from a push.
Simulating and visualising MA(1)
The playground below simulates 300 observations from an MA(1) process (y_t = e_t + 0.8 * e_(t-1)), plots the series, and shows the sample ACF so you can see the cutoff qualitatively.
In the ACF panel you should see one prominent bar at lag 1 that stands clearly above the confidence band, with all subsequent bars clustering near zero. That is the MA(1) fingerprint. If you change theta1 to a negative value (say -0.8) the lag-1 bar flips to negative, but the cutoff pattern stays the same.
Fitting an MA model with statsmodels
The ARIMA class in statsmodels fits any combination of AR, differencing, and MA orders. To fit a pure MA(q) you set the AR order to 0 and the differencing order to 0:
from statsmodels.tsa.arima.model import ARIMA
import numpy as np
np.random.seed(0)
n = 300
theta1 = 0.8
shocks = np.random.normal(0, 1, size=n + 1)
y = np.array([shocks[t] + theta1 * shocks[t - 1] for t in range(1, n + 1)])
# Fit MA(1): order=(AR=0, I=0, MA=1)
model = ARIMA(y, order=(0, 0, 1))
result = model.fit()
print(result.summary())
# The fitted MA coefficient should be close to 0.8
print("Fitted theta_1:", result.params["ma.L1"])
MA in the context of ARIMA
ARIMA(p, d, q) combines three ideas:
- p — how many past values to include (AR part).
- d — how many times to difference to achieve stationarity (I part).
- q — how many past shocks to include (MA part).
A pure MA(q) is the special case ARIMA(0, 0, q). In practice, many real series need both AR and MA terms because levels persist (AR) and shocks reverberate for a few steps (MA). The combined model handles both at once, which is why ARIMA is the default workhorse for univariate time series forecasting.
Quick check
Practice this in an interview
All questionsMAPE (Mean Absolute Percentage Error) is intuitive and scale-free but breaks when actuals are near zero and penalises under-forecasts more than over-forecasts. MASE (Mean Absolute Scaled Error) solves both issues by scaling errors against a naive seasonal benchmark, making it valid even with zero values and comparable across series with different scales.
The ACF measures correlation between a series and its own lags including indirect effects; the PACF strips out those indirect effects to show direct correlation at each lag. A cut-off in the PACF after lag p signals an AR(p) process; a cut-off in the ACF after lag q signals an MA(q) process.
Regression to the mean is the statistical tendency for extreme measurements to be followed by measurements closer to the population mean, purely due to random noise — not because of any intervention. Analysts who intervene after observing an extreme value and then observe improvement often incorrectly attribute the recovery to their action.
Adam combines momentum (exponential moving average of gradients) with RMSProp-style adaptive per-parameter learning rates (exponential moving average of squared gradients). This means parameters with consistently large gradients get smaller effective steps, and sparse or small-gradient parameters get larger steps — making Adam nearly hyperparameter-free and fast-converging compared to vanilla SGD.