Evaluating forecasts (walk-forward)
How to honestly measure whether your forecast is any good — and prove it beats just guessing yesterday's value.
What you'll learn
- MAE, RMSE, MAPE, and MASE: what each metric captures and when to prefer one over another
- Walk-forward (rolling-origin) backtesting: the only time-respecting way to validate a forecast
- Why every reported metric needs a naive baseline — and how to construct one in three lines of code
Before you start
Why naive metrics on random splits are meaningless
Imagine reporting that your model achieves 94 % accuracy on a test set — then revealing you shuffled all observations and used random 5-fold cross-validation. A colleague who knows time series would immediately ask: “Did your training data include observations from after some of the test observations?” If the answer is yes, you leaked the future into the past, and the metric is worthless.
This is not a minor technicality. A model that “saw” future data during training can appear to forecast brilliantly while failing completely on real deployments. Fixing this requires a different validation strategy — walk-forward backtesting — covered later in this lesson.
The second trap is reporting a metric with no comparison. An error of 42 units sounds precise, but if a child with no domain knowledge could achieve an error of 41 by guessing “tomorrow equals today,” you have not demonstrated any skill. You need a baseline to beat.
Error metrics
All four standard metrics below measure how far your forecasts stray from the true values. Each captures a different aspect of forecast quality.
MAE — Mean Absolute Error
MAE averages the absolute differences between predictions and actuals:
MAE = mean(|actual − forecast|)
MAE is in the same units as your data. It treats all errors equally regardless of size. That makes it robust to occasional large misses, but it also means one spectacular outlier does not dominate the score.
RMSE — Root Mean Squared Error
RMSE squares each error before averaging, then takes the square root:
RMSE = sqrt(mean((actual − forecast)²))
Because squaring amplifies large errors, RMSE penalizes big misses more than MAE does. Use RMSE when large errors are disproportionately costly in your domain (inventory shortfalls, safety margins). RMSE is always greater than or equal to MAE; the gap grows with error variance.
MAPE — Mean Absolute Percentage Error
MAPE expresses errors as a fraction of the actual values:
MAPE = mean(|actual − forecast| / |actual|) × 100
MAPE is scale-free and looks intuitive (“my forecasts are off by 8 % on average”), which makes it popular in business reporting. Two serious limitations:
- Division by zero — if any actual value is zero (or near zero), MAPE explodes.
- Asymmetry — a forecast that is too high and one that is too low by the same absolute amount produce different percentage errors, which biases model selection.
Use MAPE only when all actual values are comfortably above zero and the asymmetry is acceptable.
MASE — Mean Absolute Scaled Error
MASE is the one metric specifically designed for time series comparison. It divides your model’s MAE by the MAE of the naive in-sample forecast (one-step shift):
MASE = MAE(model) / MAE(naive one-step-ahead on training set)
A MASE < 1 means your model beats the naive baseline on average. A MASE > 1 means even last-period’s value would have been a better forecast. MASE handles zeros gracefully (the denominator is a training-set average that is almost never zero), is scale-free, and is symmetric. It is the default metric in the M-competition literature.
The baselines you must beat
Before trusting any model, compute two baselines:
Naive forecast — the forecast for the next step is simply the most recent observed value. For a series y[t], the naive forecast for y[t+1] is y[t]. This is the cheapest possible forecast and the minimum bar any model should clear.
Seasonal-naive forecast — the forecast for the next step is the value from exactly one season ago. For monthly data with annual seasonality, the forecast for January 2026 is the actual January 2025 value. Seasonal-naive is often surprisingly hard to beat.
If your model does not outperform both baselines across a walk-forward backtest, it has not demonstrated genuine skill.
Walk-forward (rolling-origin) backtesting
Walk-forward backtesting — also called rolling-origin or time-series cross-validation — is the time-respecting alternative to random K-fold. The idea is simple:
- Fix a minimum training size (the “initial window”).
- Train the model on all data up to time
t. - Forecast one step (or several steps) ahead.
- Record the error.
- Advance
tby one period and repeat.
This mimics exactly how you will use the model in production: always training on the past, predicting the future.
There are two window strategies:
- Expanding window — the training set grows with each fold. Every new fold adds the most recent observation to training. The model eventually sees the full history.
- Sliding window — the training set stays a fixed length. The oldest observations drop off as newer ones enter. Useful when you suspect older data is less relevant (e.g., after a structural break).
Expanding-window diagram
Expanding-window walk-forward: the training block grows by one observation each fold; the single next point is the test target.
Each row is one fold. Training always ends before the test point. No future information ever enters training.
Runnable example: walk-forward backtest with naive baseline
The code below generates a synthetic time series, runs an expanding-window walk-forward loop using only the naive forecast (last observed value), computes MAE at each fold, and prints the overall score. Study the loop structure — this is the pattern you will reuse with any forecast model by replacing the single line that computes the prediction.
The loop expands the training window one step at a time — exactly the expanding-window strategy from the diagram. Replacing naive_forecast = train[-1] with any model’s prediction gives you a fair backtest for that model, with identical folds for fair comparison.
To compare a second model, run the same loop and compare MAEs. To compute MASE, divide the model’s MAE by the naive MAE printed above.
Putting it together: a reporting checklist
When you share forecast results, include:
- Which metric(s) you used and why.
- Your model’s score and the naive baseline’s score on the same folds.
- Which validation strategy you used (walk-forward expanding / sliding, minimum training window size, forecast horizon).
- MASE if the audience is technical — it summarizes the skill comparison in a single number.
A result that passes all four points is an honest, reproducible evaluation.
Quick check
Practice this in an interview
All questionsMAPE (Mean Absolute Percentage Error) is intuitive and scale-free but breaks when actuals are near zero and penalises under-forecasts more than over-forecasts. MASE (Mean Absolute Scaled Error) solves both issues by scaling errors against a naive seasonal benchmark, making it valid even with zero values and comparable across series with different scales.
Walk-forward validation (also called time-series cross-validation or expanding-window CV) creates successive train/test folds where each fold's test set is always strictly in the future relative to its training set. It mimics real deployment — you fit on what you knew then and evaluate on what happened next — unlike random k-fold, which lets future data contaminate training.
When true labels are unavailable or arrive weeks late, you monitor leading indicators instead: input distribution drift, output score distribution shift, proxy business metrics, and inter-model disagreement. These act as early-warning signals before any labelled evaluation becomes possible.
Being wrong and recognizing it quickly is a signal of strong analytical judgment — not weakness. The best answers name a specific, consequential mistake, explain how you discovered you were wrong, describe what you did about it, and connect it to a habit or process you changed as a result.