When would you use MAPE versus MASE to evaluate a forecast, and what are the failure modes of each?

MAPE (Mean Absolute Percentage Error) is intuitive and scale-free but breaks when actuals are near zero and penalises under-forecasts more than over-forecasts. MASE (Mean Absolute Scaled Error) solves both issues by scaling errors against a naive seasonal benchmark, making it valid even with zero values and comparable across series with different scales.

How do you decide if a new model is actually better in production?

Offline metrics often don't predict business impact, so you run a controlled online experiment: split live traffic between the current champion and the new challenger and compare a pre-registered business metric with a statistical significance test. You size the test for adequate power, watch guardrail metrics like latency and errors, and only ship if the lift is statistically and practically significant. Variance-reduction techniques like CUPED let you reach significance faster.

What is walk-forward validation, and why is it the correct cross-validation strategy for time series?

Walk-forward validation (also called time-series cross-validation or expanding-window CV) creates successive train/test folds where each fold's test set is always strictly in the future relative to its training set. It mimics real deployment — you fit on what you knew then and evaluate on what happened next — unlike random k-fold, which lets future data contaminate training.

How do you monitor a model when ground-truth labels are delayed or never arrive?

When true labels are unavailable or arrive weeks late, you monitor leading indicators instead: input distribution drift, output score distribution shift, proxy business metrics, and inter-model disagreement. These act as early-warning signals before any labelled evaluation becomes possible.

Evaluating forecasts (walk-forward) — Time Series

The section gave you five ways to forecast — ARIMA, SARIMA, ETS, Prophet, gradient-boosted features — and left the deciding question open: which is best? It also kept warning how easy it is to fool yourself with a shuffled split or a leaked feature. This final lesson is the honest measuring stick: the right error metric, a baseline you must beat, and a validation scheme that never lets the future leak into the past.

Why naive metrics on random splits are meaningless

Imagine reporting that your model achieves 94 % accuracy on a test set — then revealing you shuffled all observations and used random 5-fold cross-validation. A colleague who knows time series would immediately ask: “Did your training data include observations from after some of the test observations?” If the answer is yes, you leaked the future into the past, and the metric is worthless.

This is not a minor technicality. A model that “saw” future data during training can appear to forecast brilliantly while failing completely on real deployments. Fixing this requires a different validation strategy — walk-forward backtesting — covered later in this lesson.

The second trap is reporting a metric with no comparison. An error of 42 units sounds precise, but if a child with no domain knowledge could achieve an error of 41 by guessing “tomorrow equals today,” you have not demonstrated any skill. You need a baseline to beat.

Error metrics

All four standard metrics below measure how far your forecasts stray from the true values. Each captures a different aspect of forecast quality.

MAE — Mean Absolute Error

MAE averages the absolute differences between predictions and actuals:

MAE = mean(|actual − forecast|)

MAE is in the same units as your data. It treats all errors equally regardless of size. That makes it robust to occasional large misses, but it also means one spectacular outlier does not dominate the score.

RMSE — Root Mean Squared Error

RMSE squares each error before averaging, then takes the square root:

RMSE = sqrt(mean((actual − forecast)²))

Because squaring amplifies large errors, RMSE penalizes big misses more than MAE does. Use RMSE when large errors are disproportionately costly in your domain (inventory shortfalls, safety margins). RMSE is always greater than or equal to MAE; the gap grows with error variance.

MAPE — Mean Absolute Percentage Error

MAPE expresses errors as a fraction of the actual values:

MAPE = mean(|actual − forecast| / |actual|) × 100

MAPE is scale-free and looks intuitive (“my forecasts are off by 8 % on average”), which makes it popular in business reporting. Two serious limitations:

Division by zero — if any actual value is zero (or near zero), MAPE explodes.
Asymmetry — a forecast that is too high and one that is too low by the same absolute amount produce different percentage errors, which biases model selection.

Use MAPE only when all actual values are comfortably above zero and the asymmetry is acceptable.

MASE — Mean Absolute Scaled Error

MASE is the one metric specifically designed for time series comparison. It divides your model’s MAE by the MAE of the naive in-sample forecast (one-step shift):

MASE = MAE(model) / MAE(naive one-step-ahead on training set)

A MASE < 1 means your model beats the naive baseline on average. A MASE > 1 means even last-period’s value would have been a better forecast. MASE handles zeros gracefully (the denominator is a training-set average that is almost never zero), is scale-free, and is symmetric. It is the default metric in the M-competition literature.

The baselines you must beat

Before trusting any model, compute two baselines:

Naive forecast — the forecast for the next step is simply the most recent observed value. For a series y[t], the naive forecast for y[t+1] is y[t]. This is the cheapest possible forecast and the minimum bar any model should clear.

Seasonal-naive forecast — the forecast for the next step is the value from exactly one season ago. For monthly data with annual seasonality, the forecast for January 2026 is the actual January 2025 value. Seasonal-naive is often surprisingly hard to beat.

If your model does not outperform both baselines across a walk-forward backtest, it has not demonstrated genuine skill.

Walk-forward (rolling-origin) backtesting

Walk-forward backtesting — also called rolling-origin or time-series cross-validation — is the time-respecting alternative to random K-fold. The idea is simple:

Fix a minimum training size (the “initial window”).
Train the model on all data up to time t.
Forecast one step (or several steps) ahead.
Record the error.
Advance t by one period and repeat.

This mimics exactly how you will use the model in production: always training on the past, predicting the future.

There are two window strategies:

Expanding window — the training set grows with each fold. Every new fold adds the most recent observation to training. The model eventually sees the full history.
Sliding window — the training set stays a fixed length. The oldest observations drop off as newer ones enter. Useful when you suspect older data is less relevant (e.g., after a structural break).

Expanding-window diagram

Expanding-window walk-forward: the training block grows by one observation each fold; the single next point is the test target.

Each row is one fold. Training always ends before the test point. No future information ever enters training.

Worked example: walk-forward backtest with naive baseline

The code below generates a synthetic time series, runs an expanding-window walk-forward loop using only the naive forecast (last observed value), computes MAE at each fold, and prints the overall score. Study the loop structure — this is the pattern you will reuse with any forecast model by replacing the single line that computes the prediction.

import numpy as np

rng = np.random.default_rng(42)
n = 60
trend = np.linspace(0, 10, n)
noise = rng.normal(0, 1.5, n)
y = trend + noise

min_train = 20
errors = []

for t in range(min_train, n - 1):
    train = y[:t]
    actual = y[t]

    naive_forecast = train[-1]

    error = abs(actual - naive_forecast)
    errors.append(error)

mae = np.mean(errors)
print(f"Walk-forward folds evaluated: {len(errors)}")
print(f"Naive baseline MAE:           {mae:.3f}")
print()
print("Per-fold absolute errors (first 10):")
for i, e in enumerate(errors[:10]):
    print(f"  Fold {i+1:2d}: |error| = {e:.3f}")

Walk-forward folds evaluated: 39
Naive baseline MAE:           1.122

Per-fold absolute errors (first 10):
  Fold  1: |error| = 0.033
  Fold  2: |error| = 0.575
  Fold  3: |error| = 3.025
  Fold  4: |error| = 1.896
  Fold  5: |error| = 0.241
  Fold  6: |error| = 0.284
  Fold  7: |error| = 1.496
  Fold  8: |error| = 0.081
  Fold  9: |error| = 0.240
  Fold 10: |error| = 0.197

The naive baseline scores an MAE of 1.122 across 39 expanding-window folds — and that is the number any real model now has to beat. The loop expands the training window one step at a time (exactly the expanding-window strategy from the diagram); replacing naive_forecast = train[-1] with any model’s prediction gives you a fair backtest on identical folds.

To compare a second model, run the same loop and compare MAEs. To compute MASE, divide the model’s MAE by the naive MAE printed above.

Putting it together: a reporting checklist

When you share forecast results, include:

Which metric(s) you used and why.
Your model’s score and the naive baseline’s score on the same folds.
Which validation strategy you used (walk-forward expanding / sliding, minimum training window size, forecast horizon).
MASE if the audience is technical — it summarizes the skill comparison in a single number.

A result that passes all four points is an honest, reproducible evaluation.

In one breath

A forecast metric means nothing without two things: a baseline and a time-respecting split. Pick a metric for the job — MAE (same units, robust), RMSE (punishes big misses), MAPE (scale-free % but explodes near zero and is asymmetric), or MASE (model MAE ÷ naive MAE; below 1 = beats naive, scale-free and symmetric — the time-series default). Then beat the naive (“tomorrow = today”) and seasonal-naive (”= same period last season”) baselines, or you’ve shown no skill. And never use random K-fold — it leaks the future into training. Use walk-forward backtesting instead: train on the past, forecast one step, record the error, advance, repeat — expanding or sliding window — exactly mirroring production.

Practice

Quick check

0/3

Q1You are evaluating two models. Model A has MAE = 18. Model B has MAE = 22. A colleague says Model A is better. What critical information is missing before you can agree?

Q2A data scientist applies 5-fold cross-validation by shuffling the time series randomly before splitting. What is the core problem with this approach?

Q3You build a demand forecast for a retail product. Your model's walk-forward MAE is 31 units. The naive baseline MAE is 28 units. The seasonal-naive baseline MAE is 19 units. What should you conclude?

A question to carry forward

That closes the time-series section. Step all the way back and notice what every model in it shared: they predicted the next value of one sequence over time — tomorrow’s sales, next month’s demand. Powerful, but a narrow shape of problem.

Yet some of the highest-leverage prediction in production isn’t about time at all. When Netflix decides what to show you, it isn’t forecasting a number forward — it’s matching one person, out of millions, to one item, out of millions, by taste. So the question that opens the next section is: how do you predict not “what comes next in time,” but “what would this specific person love that they haven’t seen yet?” The next section is recommender systems — and it begins, fittingly, with a baseline you already know to demand: the dumb popularity list any real recommender has to beat.

Evaluating forecasts (walk-forward)

What you'll learn

Before you start