Evaluating forecasts: MAE, RMSE, MAPE, and honest backtesting

You trained a model on two years of weekly sales data, held out the last eight weeks, generated predictions, and computed the error. Looks solid. You ship the model. Three months later, it is badly wrong on every peak season it encounters — the one scenario that actually costs you money.

The problem is almost never the algorithm. It is the evaluation. Forecasting has a peculiar trap that does not affect most machine-learning tasks: the future leaks into the present whenever you are careless about time. Add to that a zoo of accuracy metrics that each measure something subtly different, and you have a discipline where it is genuinely easy to convince yourself you have a good model when you do not.

This guide walks through the metrics you should know, what each one is actually measuring, and how to set up backtesting so that your error estimates reflect real-world performance. It builds on the time series fundamentals and the model-fitting details covered in ARIMA and ACF / PACF.

The four metrics worth knowing

MAE — mean absolute error

The most interpretable forecast metric. For a series of actuals y_t and point forecasts ŷ_t over n evaluation periods:

MAE = (1/n) * sum(|y_t - ŷ_t|)

MAE is in the same units as the series. If you are forecasting daily revenue in dollars, MAE is in dollars. It treats every error equally regardless of sign, so a forecast that is $100 too high is penalized identically to one that is $100 too low.

When to prefer it: MAE is the right default when errors of different sizes are proportionally costly and you care about the typical magnitude of mistakes. It is robust: a handful of extreme values does not dominate the aggregate the way they do with squared-error metrics.

What it hides: because it averages absolute deviations, a single catastrophically wrong forecast on a peak period contributes only proportionally, not disproportionately. If getting a Black Friday demand forecast wrong by a factor of three is far worse than ten modest weekly misses, MAE will not reflect that.

RMSE — root mean squared error

RMSE = sqrt((1/n) * sum((y_t - ŷ_t)^2))

RMSE is also in the original units, but squaring errors before averaging means large errors matter much more than small ones. A forecast error of 200 contributes four times as much to the sum of squares as an error of 100.

When to prefer it: use RMSE when large misses are disproportionately costly — inventory stockouts, power grid imbalances, hospital staffing. RMSE implicitly says “I am willing to trade many small improvements to avoid one large failure.”

What it hides: outliers in the actual data (not errors in the forecast) can inflate RMSE substantially, making a perfectly reasonable model look worse than it is. It is also harder to explain to stakeholders than MAE.

MAPE — mean absolute percentage error

MAPE = (100/n) * sum(|y_t - ŷ_t| / |y_t|)

Expressed as a percentage, which makes it superficially easy to interpret and compare across series with different scales. A 5% MAPE on weekly sales and a 5% MAPE on hourly electricity consumption look equivalent — and in some sense they are.

When to use it anyway: MAPE is genuinely useful for series that are consistently well above zero and where stakeholders need a percentage-based KPI. Retail forecasting for fast-moving consumer goods, for example, fits reasonably well.

sMAPE — symmetric MAPE

A common fix for MAPE’s asymmetry replaces the denominator with the average of the absolute actual and absolute forecast:

sMAPE = (100/n) * sum(2*|y_t - ŷ_t| / (|y_t| + |ŷ_t|))

This bounds the metric between 0% and 200% and reduces (but does not eliminate) the asymmetry. sMAPE is used in the M-competition benchmarks for this reason. However, it has its own quirk: the denominator changes with every forecast, so sMAPE is no longer a pure measure of forecast error on the original scale, and it can still behave strangely near zero.

MASE — mean absolute scaled error

MASE is the most rigorous of the group and the one most worth learning if you compare forecasts across multiple series or multiple models. The idea: scale errors by the in-sample MAE of a naive seasonal benchmark.

naive_mae = (1/(n-m)) * sum(|y_t - y_{t-m}|)   # m = seasonal period
MASE = MAE_forecast / naive_mae

A MASE < 1 means the model beats the naive seasonal random walk. A MASE > 1 means it does not. Scale-free and interpretable, MASE can be averaged across series without the weighting problems that plague averaged MAPE. It handles zero actuals gracefully (the denominator never involves forecasts).

When to prefer it: multi-series benchmarking, competitions, or any situation where you need a single number that summarizes performance across a heterogeneous portfolio of time series.

Metric comparison at a glance

Metric comparison: MAE, RMSE, MAPE, and MASE across four evaluation dimensions.

Why a single holdout lies to you

The natural first instinct when evaluating a forecast model is to cut off the last few months of data, train on everything before, forecast the holdout period, and compute errors. This is a reasonable starting point but a deeply unreliable final answer.

Consider what you are actually measuring. You have chosen one specific train/test split. The model is implicitly tuned (even informally) to whatever characteristics the training data has. The holdout period may happen to be unusually easy or unusually hard to forecast. Seasonal dynamics in the few weeks right before your cutoff can spill into forecast errors in ways that would not occur at a different cutoff date. And if you iterated the model at all — tried different specifications, changed hyperparameters, re-examined features — you have effectively peeked at the holdout multiple times and introduced leakage.

The correct alternative is rolling-origin backtesting, also called walk-forward validation or time-series cross-validation.

Rolling-origin backtesting

The idea is to simulate production repeatedly rather than once. Fix a forecast horizon h. Choose a minimum training window. Then:

Train the model on observations 1 through t.
Forecast periods t+1 through t+h.
Record the errors at each horizon step.
Advance the origin by one period (or k periods for efficiency) and repeat.
Average errors across all origins, separately for each horizon step.

This gives you not one error estimate but a distribution of error estimates across different slices of history, each producing forecasts under conditions roughly like those the deployed model will face.

Rolling-origin (walk-forward) backtesting: the training window expands with each origin, and forecast errors are averaged across all origins for each horizon step.

A concrete implementation in Python is straightforward:

import numpy as np

def rolling_origin_mae(y, min_train, h, refit_fn):
    """
    y         : 1-D array of observations
    min_train : minimum number of training observations
    h         : forecast horizon
    refit_fn  : callable(train_array) -> forecast array of length h
    """
    errors = []
    for origin in range(min_train, len(y) - h + 1):
        train = y[:origin]
        actual = y[origin : origin + h]
        forecast = refit_fn(train)
        errors.append(np.abs(actual - forecast))
    errors = np.array(errors)          # shape: (n_origins, h)
    return errors.mean(axis=0)         # MAE per horizon step

A few implementation details that matter:

Refit vs. fixed parameters. In production, you will periodically refit the model as new data arrives. Your backtest should do the same. Using fixed parameters across all origins will overstate accuracy for origins far from the original training cutoff.
Minimum training length. Make it at least two to three full seasonal cycles. An ARIMA model fit on fewer observations than one seasonal period will be poorly identified.
Skip-k origins. Refitting at every single period is slow. Skipping every k periods (say, weekly refits for a daily series) is usually a good tradeoff and closer to real operational cadence.

Horizon matters: report errors per step

It is tempting to average errors across all h forecast steps into a single number. Resist the temptation. Errors grow with horizon — a model that is excellent at h=1 but hopeless at h=12 is a very different beast from one that is mediocre but consistent across all steps. Decision-makers often care most about specific horizons (a retailer placing orders eight weeks out, a grid operator forecasting 24 hours ahead).

Report at minimum: MAE (or RMSE) at h=1, the midpoint, and h=max. Plot the full error curve if you can — it tells you whether uncertainty grows smoothly (a well-behaved model) or spikes at specific horizons (a seasonal misspecification).

Prediction intervals and coverage

A point forecast without an uncertainty band is like a weather report that says “temperature: 22°C” with no indication of whether to expect 18°C or 35°C. Prediction intervals quantify what the model genuinely does not know.

A nominal 95% prediction interval is supposed to contain the actual outcome 95% of the time. Checking this — empirical coverage — is part of an honest evaluation. You compute it exactly as you would compute point-forecast errors in rolling-origin backtest: for each origin and horizon, check whether the actual value fell inside the interval. Average across origins.

If empirical 95% coverage is 72%, your intervals are too narrow. The model is overconfident. If it is 99%, the intervals are too wide — honest but uninformative. Good coverage near the nominal level, plus a tight interval width, is the target.

The glossary has brief definitions of coverage, CRPS, and other probabilistic scoring rules if you want to go deeper on the distributional evaluation side.

Leakage in time-series backtests

Unlike standard cross-validation, time-series backtests can still leak in subtle ways:

Feature leakage. If you standardize a feature using statistics computed on the full series (including test periods), the test is compromised. Compute normalization statistics on the training window only, at each origin.
Hyperparameter leakage. If you chose the ARIMA order or Prophet seasonality mode by looking at which performed best on the holdout, you tuned on the test set. Use a separate validation window (or nested rolling-origin CV) for hyperparameter selection.
Calendar features. Be careful with features like “is this a holiday” — future holidays are knowable and fine, but lags derived from future actuals are not.

The interview prep section covers how these leakage patterns are tested in data science interviews, where forecasting case studies are common.

Putting it together: a practical checklist

When you next evaluate a time-series model, work through this:

Choose the primary metric based on the cost structure — RMSE if large misses are catastrophic, MAE for typical error tracking, MASE for cross-series comparison.
Avoid MAPE if any actuals are near or at zero. Use sMAPE or MASE instead.
Set up rolling-origin backtesting with at least ten to twenty origins.
Refit the model at each origin using the same process as production.
Report errors per horizon step, not a single aggregate.
Check prediction interval coverage against the nominal levels.
Confirm no feature normalization, hyperparameter selection, or lag computation touched future data.

None of this is complicated, but it requires discipline. The single biggest source of over-optimistic forecast evaluations in production systems is not a subtle statistical error — it is someone who ran a single train/test split in 2022 and has been trusting the number ever since.

Frequently asked questions

Q: My MAPE is 8% and my colleague’s is 11% on the same series. Does that mean my model is better?

Not necessarily. First, check whether you computed MAPE on the same holdout windows — a single split for one and a rolling average for the other will not be comparable. Second, if your series has any near-zero observations, one outlier period with a very small actual can dominate MAPE in both directions. Use MASE or MAE on identical rolling-origin runs for a fair comparison.

Q: How many rolling origins do I need for a stable error estimate?

There is no universal threshold, but fewer than ten origins produces estimates with high variance — you may be unlucky or lucky by a meaningful margin. For weekly data with a multi-year history, aim for at least 20 to 50 origins. For monthly data, 24 to 36 origins (two to three years of walk-forward) is a reasonable minimum. Watch the variance of the per-origin errors; if it is very high, you need more origins.

Q: Should I use an expanding window or a rolling fixed-length window for training?

For most production forecasting, an expanding window (all history up to the origin) is closer to reality — you would use all available data when refitting in production. A fixed-length window makes sense if you believe the data-generating process is non-stationary and older data is harmful (e.g., a business that pivoted strategy two years ago). Try both and compare; if they give very different error estimates, you have an important structural question to answer about your data.

Q: MASE references a naive seasonal benchmark. What seasonal period should I use?

Use the period that matches the dominant seasonality in your data — 7 for daily data with weekly cycles, 12 for monthly data with annual cycles, 52 for weekly data with annual cycles. If there is no meaningful seasonality, use m=1, which reduces the denominator to the MAE of a random walk (no-change forecast). The key is consistency: use the same m across all models you are comparing.