datarekha

Moving average (MA)

How MA(q) models the lingering effect of past shocks — and why it is nothing like a rolling-average smoother.

8 min read Intermediate Time Series Lesson 6 of 14

What you'll learn

  • MA(q) formula: y_t equals the mean plus the current shock plus a weighted sum of the q previous shocks
  • Why MA captures short-lived surprise effects while AR captures level persistence
  • ACF cuts off at lag q — the signature diagnostic that identifies an MA process

Before you start

The MA(q) model in plain language

An autoregressive model says: today’s value is a weighted sum of recent past values plus noise. If sales were high yesterday they pull today’s sales up.

A moving-average model says something completely different: today’s value is the long-run mean plus a weighted sum of recent past surprises — not levels. A surprise, also called a shock or innovation, is the gap between what actually happened and what the model expected. Formally it is the white-noise error term, often written e_t (white noise means zero mean, constant variance, and no autocorrelation across time — each e_t is an independent draw from the same distribution).

The MA(q) model:

y_t = mu + e_t + theta_1 * e_(t-1) + theta_2 * e_(t-2) + ... + theta_q * e_(t-q)
  • mu is the unconditional mean of the series.
  • e_t is the current shock (white noise, unknown until time t).
  • e_(t-1) through e_(t-q) are the q most recent past shocks.
  • theta_1 through theta_q are the MA coefficients (the weights on past shocks).

The order q is the “memory” of the model: how many past shocks still influence the present.

Why shocks, not levels?

Think of a central bank making an unexpected interest-rate announcement. That surprise ripples through markets for a day or two, then fades. The series does not stay permanently elevated — it absorbs the shock and returns to baseline. That is an MA signature: a short, sharp, finite-duration effect.

Compare that to AR, where a high value today raises the expected value tomorrow, which in turn raises the one after that, and so on. AR captures persistence in levels — effects can linger for many steps. MA captures persistence of surprises — effects are finite and die out after exactly q lags.

This distinction matters because the real world has both: a shock can push a series away from its mean (MA) while the series simultaneously tends to drift back toward it (AR). Combining them is exactly what ARIMA does.

The big naming confusion

How MA and AR look in a diagram

The diagram below contrasts the two feedback structures. In AR, past values of y loop back as inputs. In MA, past shocks (e) feed forward into the current value — there is no feedback from past y at all.

MA(1) — shocks feed forward into y_te_(t-1)e_ty_tθ₁ × e_(t-1)1 × e_t→ no loop back to y_(t-1)AR(1) — past values feed back (contrast)y_(t-1)e_ty_tphi₁ × y_(t-1)1 × e_t

Top: MA(1) — only past shocks (e) feed into y_t; there is no recurrence on past y. Bottom: AR(1) shown for contrast — past values of y loop back. The two structures are genuinely different, not just reparameterisations.

The ACF signature: cuts off at lag q

Every model leaves a fingerprint in the autocorrelation function (ACF) — the correlation between y_t and y_(t-k) at various lags k.

For an MA(q) process the math works out cleanly: y_t shares error terms with y_(t-1) through y_(t-q) (they overlap in their shock windows), but y_t shares no error terms at all with y_(t-q-1) or anything earlier. That means:

  • ACF is non-zero at lags 1 through q.
  • ACF drops to (approximately) zero at lag q+1 and beyond.

This sharp cutoff is the fingerprint. If you plot an ACF and see significant bars only for the first q lags, suspect an MA(q) process.

Contrast with AR: AR’s PACF cuts off sharply at lag p, while its ACF decays gradually. MA and AR are mirror images in the ACF/PACF diagnostic table.

ACFPACF
AR(p)Decays gradually (tails off)Cuts off after lag p
MA(q)Cuts off after lag qDecays gradually (tails off)
ARMA(p,q)Tails offTails off

Short memory is a feature, not a bug

An MA process has short memory by construction. No matter how large the theta coefficients are, the effect of any shock vanishes after exactly q steps. This makes MA models appropriate when shocks are transient — news events, measurement errors, one-off disruptions that are absorbed quickly.

AR models, by contrast, can have long memory depending on their coefficients: a near-unit-root AR(1) with phi close to 1 transmits shocks forward for many steps. Choosing between AR and MA (or combining them in ARMA/ARIMA) is largely a question of how fast the series recovers from a push.

Simulating and visualising MA(1)

The playground below simulates 300 observations from an MA(1) process (y_t = e_t + 0.8 * e_(t-1)), plots the series, and shows the sample ACF so you can see the cutoff qualitatively.

In the ACF panel you should see one prominent bar at lag 1 that stands clearly above the confidence band, with all subsequent bars clustering near zero. That is the MA(1) fingerprint. If you change theta1 to a negative value (say -0.8) the lag-1 bar flips to negative, but the cutoff pattern stays the same.

Fitting an MA model with statsmodels

The ARIMA class in statsmodels fits any combination of AR, differencing, and MA orders. To fit a pure MA(q) you set the AR order to 0 and the differencing order to 0:

from statsmodels.tsa.arima.model import ARIMA
import numpy as np

np.random.seed(0)
n = 300
theta1 = 0.8
shocks = np.random.normal(0, 1, size=n + 1)
y = np.array([shocks[t] + theta1 * shocks[t - 1] for t in range(1, n + 1)])

# Fit MA(1): order=(AR=0, I=0, MA=1)
model = ARIMA(y, order=(0, 0, 1))
result = model.fit()

print(result.summary())
# The fitted MA coefficient should be close to 0.8
print("Fitted theta_1:", result.params["ma.L1"])

MA in the context of ARIMA

ARIMA(p, d, q) combines three ideas:

  • p — how many past values to include (AR part).
  • d — how many times to difference to achieve stationarity (I part).
  • q — how many past shocks to include (MA part).

A pure MA(q) is the special case ARIMA(0, 0, q). In practice, many real series need both AR and MA terms because levels persist (AR) and shocks reverberate for a few steps (MA). The combined model handles both at once, which is why ARIMA is the default workhorse for univariate time series forecasting.


Quick check

0/3
Q1In an MA(2) model, for how many lags after a shock does that shock continue to directly influence future values of y?
Q2You compute the sample ACF of a stationary time series and observe significant autocorrelations at lags 1 and 2, with all lags 3 and above near zero. The PACF decays gradually. What model does this suggest?
Q3A retail analyst uses a 7-day rolling average to smooth daily sales before presenting a trend chart to leadership. A data scientist colleague says this is 'just an MA model.' Is the colleague right?

Practice this in an interview

All questions
When would you use MAPE versus MASE to evaluate a forecast, and what are the failure modes of each?

MAPE (Mean Absolute Percentage Error) is intuitive and scale-free but breaks when actuals are near zero and penalises under-forecasts more than over-forecasts. MASE (Mean Absolute Scaled Error) solves both issues by scaling errors against a naive seasonal benchmark, making it valid even with zero values and comparable across series with different scales.

How do you read ACF and PACF plots, and what do they tell you about AR and MA orders?

The ACF measures correlation between a series and its own lags including indirect effects; the PACF strips out those indirect effects to show direct correlation at each lag. A cut-off in the PACF after lag p signals an AR(p) process; a cut-off in the ACF after lag q signals an MA(q) process.

What is regression to the mean, and why does it fool analysts into seeing treatment effects that do not exist?

Regression to the mean is the statistical tendency for extreme measurements to be followed by measurements closer to the population mean, purely due to random noise — not because of any intervention. Analysts who intervene after observing an extreme value and then observe improvement often incorrectly attribute the recovery to their action.

What does the Adam optimizer do, and what problem does it solve over SGD?

Adam combines momentum (exponential moving average of gradients) with RMSProp-style adaptive per-parameter learning rates (exponential moving average of squared gradients). This means parameters with consistently large gradients get smaller effective steps, and sparse or small-gradient parameters get larger steps — making Adam nearly hyperparameter-free and fast-converging compared to vanilla SGD.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content