datarekha

Evaluating forecasts (walk-forward)

How to honestly measure whether your forecast is any good — and prove it beats just guessing yesterday's value.

9 min read Intermediate Time Series Lesson 14 of 14

What you'll learn

  • MAE, RMSE, MAPE, and MASE: what each metric captures and when to prefer one over another
  • Walk-forward (rolling-origin) backtesting: the only time-respecting way to validate a forecast
  • Why every reported metric needs a naive baseline — and how to construct one in three lines of code

Before you start

Why naive metrics on random splits are meaningless

Imagine reporting that your model achieves 94 % accuracy on a test set — then revealing you shuffled all observations and used random 5-fold cross-validation. A colleague who knows time series would immediately ask: “Did your training data include observations from after some of the test observations?” If the answer is yes, you leaked the future into the past, and the metric is worthless.

This is not a minor technicality. A model that “saw” future data during training can appear to forecast brilliantly while failing completely on real deployments. Fixing this requires a different validation strategy — walk-forward backtesting — covered later in this lesson.

The second trap is reporting a metric with no comparison. An error of 42 units sounds precise, but if a child with no domain knowledge could achieve an error of 41 by guessing “tomorrow equals today,” you have not demonstrated any skill. You need a baseline to beat.

Error metrics

All four standard metrics below measure how far your forecasts stray from the true values. Each captures a different aspect of forecast quality.

MAE — Mean Absolute Error

MAE averages the absolute differences between predictions and actuals:

MAE = mean(|actual − forecast|)

MAE is in the same units as your data. It treats all errors equally regardless of size. That makes it robust to occasional large misses, but it also means one spectacular outlier does not dominate the score.

RMSE — Root Mean Squared Error

RMSE squares each error before averaging, then takes the square root:

RMSE = sqrt(mean((actual − forecast)²))

Because squaring amplifies large errors, RMSE penalizes big misses more than MAE does. Use RMSE when large errors are disproportionately costly in your domain (inventory shortfalls, safety margins). RMSE is always greater than or equal to MAE; the gap grows with error variance.

MAPE — Mean Absolute Percentage Error

MAPE expresses errors as a fraction of the actual values:

MAPE = mean(|actual − forecast| / |actual|) × 100

MAPE is scale-free and looks intuitive (“my forecasts are off by 8 % on average”), which makes it popular in business reporting. Two serious limitations:

  1. Division by zero — if any actual value is zero (or near zero), MAPE explodes.
  2. Asymmetry — a forecast that is too high and one that is too low by the same absolute amount produce different percentage errors, which biases model selection.

Use MAPE only when all actual values are comfortably above zero and the asymmetry is acceptable.

MASE — Mean Absolute Scaled Error

MASE is the one metric specifically designed for time series comparison. It divides your model’s MAE by the MAE of the naive in-sample forecast (one-step shift):

MASE = MAE(model) / MAE(naive one-step-ahead on training set)

A MASE < 1 means your model beats the naive baseline on average. A MASE > 1 means even last-period’s value would have been a better forecast. MASE handles zeros gracefully (the denominator is a training-set average that is almost never zero), is scale-free, and is symmetric. It is the default metric in the M-competition literature.

The baselines you must beat

Before trusting any model, compute two baselines:

Naive forecast — the forecast for the next step is simply the most recent observed value. For a series y[t], the naive forecast for y[t+1] is y[t]. This is the cheapest possible forecast and the minimum bar any model should clear.

Seasonal-naive forecast — the forecast for the next step is the value from exactly one season ago. For monthly data with annual seasonality, the forecast for January 2026 is the actual January 2025 value. Seasonal-naive is often surprisingly hard to beat.

If your model does not outperform both baselines across a walk-forward backtest, it has not demonstrated genuine skill.

Walk-forward (rolling-origin) backtesting

Walk-forward backtesting — also called rolling-origin or time-series cross-validation — is the time-respecting alternative to random K-fold. The idea is simple:

  1. Fix a minimum training size (the “initial window”).
  2. Train the model on all data up to time t.
  3. Forecast one step (or several steps) ahead.
  4. Record the error.
  5. Advance t by one period and repeat.

This mimics exactly how you will use the model in production: always training on the past, predicting the future.

There are two window strategies:

  • Expanding window — the training set grows with each fold. Every new fold adds the most recent observation to training. The model eventually sees the full history.
  • Sliding window — the training set stays a fixed length. The oldest observations drop off as newer ones enter. Useful when you suspect older data is less relevant (e.g., after a structural break).

Expanding-window diagram

Fold1234Train (grows)TestTime →TrainingTest (one step)

Expanding-window walk-forward: the training block grows by one observation each fold; the single next point is the test target.

Each row is one fold. Training always ends before the test point. No future information ever enters training.

Runnable example: walk-forward backtest with naive baseline

The code below generates a synthetic time series, runs an expanding-window walk-forward loop using only the naive forecast (last observed value), computes MAE at each fold, and prints the overall score. Study the loop structure — this is the pattern you will reuse with any forecast model by replacing the single line that computes the prediction.

The loop expands the training window one step at a time — exactly the expanding-window strategy from the diagram. Replacing naive_forecast = train[-1] with any model’s prediction gives you a fair backtest for that model, with identical folds for fair comparison.

To compare a second model, run the same loop and compare MAEs. To compute MASE, divide the model’s MAE by the naive MAE printed above.

Putting it together: a reporting checklist

When you share forecast results, include:

  1. Which metric(s) you used and why.
  2. Your model’s score and the naive baseline’s score on the same folds.
  3. Which validation strategy you used (walk-forward expanding / sliding, minimum training window size, forecast horizon).
  4. MASE if the audience is technical — it summarizes the skill comparison in a single number.

A result that passes all four points is an honest, reproducible evaluation.


Quick check

0/3
Q1You are evaluating two models. Model A has MAE = 18. Model B has MAE = 22. A colleague says Model A is better. What critical information is missing before you can agree?
Q2A data scientist applies 5-fold cross-validation by shuffling the time series randomly before splitting. What is the core problem with this approach?
Q3You build a demand forecast for a retail product. Your model's walk-forward MAE is 31 units. The naive baseline MAE is 28 units. The seasonal-naive baseline MAE is 19 units. What should you conclude?

Practice this in an interview

All questions
When would you use MAPE versus MASE to evaluate a forecast, and what are the failure modes of each?

MAPE (Mean Absolute Percentage Error) is intuitive and scale-free but breaks when actuals are near zero and penalises under-forecasts more than over-forecasts. MASE (Mean Absolute Scaled Error) solves both issues by scaling errors against a naive seasonal benchmark, making it valid even with zero values and comparable across series with different scales.

What is walk-forward validation, and why is it the correct cross-validation strategy for time series?

Walk-forward validation (also called time-series cross-validation or expanding-window CV) creates successive train/test folds where each fold's test set is always strictly in the future relative to its training set. It mimics real deployment — you fit on what you knew then and evaluate on what happened next — unlike random k-fold, which lets future data contaminate training.

How do you monitor a model when ground-truth labels are delayed or never arrive?

When true labels are unavailable or arrive weeks late, you monitor leading indicators instead: input distribution drift, output score distribution shift, proxy business metrics, and inter-model disagreement. These act as early-warning signals before any labelled evaluation becomes possible.

Tell me about a time you were wrong about something at work.

Being wrong and recognizing it quickly is a signal of strong analytical judgment — not weakness. The best answers name a specific, consequential mistake, explain how you discovered you were wrong, describe what you did about it, and connect it to a habit or process you changed as a result.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content