The forecasting baselines that quietly beat fancy models

Here is a pattern that repeats itself across data teams at almost every scale: the team inherits a forecasting problem, someone opens a Jupyter notebook, installs Prophet or reaches for an LSTM, tunes it for two weeks, and then ships it to production. The model achieves a mean absolute error of, say, 47 units. Leadership nods approvingly. The number sounds small. Nobody checks what a naive model would have achieved.

Two months later, a data scientist joins, runs a naive forecast in ten minutes, and gets 44 units.

This post is about that gap—how it happens, why it is embarrassingly common, and what to do instead. If you are just starting with time series, read why time series is different from other prediction problems first. If you already know your ARIMA from your ETS, then this post is about the discipline layer that goes on top of the model knowledge.

The baseline is the denominator

Forecasting accuracy numbers are almost meaningless in isolation. A MAE of 47 units sounds good or bad depending entirely on how much variation is in the series. The canonical solution to this problem is the Mean Absolute Scaled Error (MASE), defined as your model’s MAE divided by the in-sample MAE of the naive one-step-ahead forecast. A MASE of 1.0 means your model is exactly as accurate as the simplest possible benchmark. A MASE below 1.0 means you are beating the naive forecast. A MASE above 1.0 means the naive forecast is better than your model.

The formula is deliberately simple: it asks only one question—does your model beat the trivial thing?

When you frame every evaluation this way, it becomes very hard to hide behind impressive-sounding absolute error numbers. This is why baselines are not a preliminary step you skip on the way to the real work. They are the real work. Everything else is a candidate for replacing them.

The canonical baselines, explained honestly

There are five baselines worth knowing. Each one exploits a different structural property of the series. Knowing which property your data has is how you choose which baseline to beat first.

Naive forecast (last-value method)

The naive forecast sets the next predicted value equal to the most recently observed value. If the series was 42 yesterday, the forecast for today is 42. If that sounds absurdly simple, consider what it implies about the data: you are assuming that the best predictor of tomorrow is today, which is exactly the definition of a random walk.

Many financial time series behave like random walks, or close enough that the naive forecast is genuinely hard to beat. Daily closing prices, exchange rates, and commodity spot prices have all been extensively studied and the literature consistently shows that even sophisticated models struggle to meaningfully outperform naive forecasts at short horizons. The reason is not that the models are bad—it is that the signal-to-noise ratio is low and the series genuinely has little predictable structure beyond its most recent value.

If your series is noisy, short, and shows no obvious trend or seasonal pattern, start here. Do not move on until something beats it.

Seasonal-naive forecast

The seasonal-naive forecast extends the naive idea to series with strong periodicity. Instead of copying the last value, it copies the value from the same point in the previous seasonal cycle. For a weekly seasonal series, the Monday forecast comes from last Monday’s value. For a monthly series, January’s forecast comes from last January.

This baseline is powerful whenever seasonality is the dominant feature of the series. Retail sales, electricity demand, website traffic, and flu incidence all have strong seasonal components, and the seasonal-naive forecast captures that structure without fitting a single parameter. It cannot model trend, and it cannot adapt to a changing seasonal shape, but on a clean strongly-seasonal series it is genuinely difficult to beat.

The practical test: decompose the series and look at the relative magnitudes of the seasonal and trend components. If seasonal amplitude dwarfs the trend, seasonal-naive is your benchmark.

Drift method

The drift method is the naive forecast plus a trend correction. It extrapolates by adding the average change per period observed over the training data. If the series has risen by an average of 3 units per week over the past year, the drift forecast for next week is last week’s value plus 3.

Drift is the right baseline when the series has a clear monotonic trend and no meaningful seasonality. It is a surprisingly strong benchmark for slowly-trending series like annual population figures, long-run revenue growth, or gradual degradation curves in equipment monitoring. It fails when the trend is nonlinear or when the rate of change varies—but if neither condition holds, drift is a hard target to beat cheaply.

Simple and moving averages

A simple moving average forecast sets the next predicted value equal to the mean of the last N observations. The choice of N controls how much smoothing you apply. A small N makes the forecast reactive but noisy. A large N makes it stable but slow to adapt.

Moving averages are most useful as a baseline when the series is stationary—no trend, no seasonality, just noise around a level. They are also the natural baseline for intermittent demand, where many periods are zero and occasional spikes make other methods unstable.

The exponentially weighted moving average (EWMA) is a natural extension: instead of equal weights, it gives more weight to recent observations. EWMA is worth including in your baseline suite because it often outperforms a simple moving average without requiring any structural assumption.

Seasonal naive with drift

The final canonical baseline combines the two strongest ideas: seasonal-naive for the seasonal component plus a trend correction. This handles series that are both trending and seasonal, which covers a large fraction of real business metrics—monthly revenue that grows every year while still peaking in Q4, daily call-centre volume that grows year-over-year while dipping on weekends.

This is usually the hardest baseline to beat and should be your default on any series that shows both trend and seasonality before you reach for Holt-Winters, Prophet, or anything more complex.

Diagram: what the baselines look like on an actual series

The figure below shows a stylised monthly series with a moderate upward trend and clear annual seasonality. The three forecasts shown are the naive forecast (flat line extending the last value), the seasonal-naive forecast (copies last year’s values), and the drift method (extends the trend linearly). The seasonal-naive and drift forecasts both outperform naive, but in different directions—seasonal-naive tracks the seasonal shape while drift tracks the level growth.

Three baseline forecasts from the same origin point. Seasonal-naive tracks the historical seasonal shape; drift extrapolates the average trend; naive holds the last value flat. None uses any model parameters.

Diagram: MASE as a model skill bar

The figure below shows a stylised comparison of MASE scores across five hypothetical forecasting approaches on the same dataset. The dashed line at 1.0 marks the naive forecast boundary—anything above it performs worse than the naive forecast, anything below it adds genuine predictive value.

Stylised MASE scores across five methods on the same seasonal series. The LSTM scores above 1.0, meaning the naive forecast outperforms it. Seasonal-naive+drift achieves the best MASE with zero learned parameters.

The no-free-lunch reality in forecasting

The no-free-lunch theorem, applied to forecasting, means that no single method wins across all series types. A model that wins on a trending series with stable seasonality may lose badly on a stationary noisy series or an intermittent demand series. This is why the M-competitions are instructive not as a ranking of individual methods but as evidence that the structure of the series—trend, seasonality, noise, length, frequency—determines what kind of model is appropriate.

The practical implication is that you cannot pick your model before you pick your baseline, because your baseline encodes the structural properties of the data. If your data is a random walk, the naive forecast is your baseline and anything that beats it on a proper holdout is adding genuine value. If your data is seasonal, the seasonal-naive is your baseline. If it is both, the seasonal-naive plus drift is the minimum bar.

Skipping this step is the most common forecasting mistake in industry. It is not that practitioners do not know the methods—most do. It is that the step feels unimportant because it takes ten minutes and produces something embarrassingly simple. The complexity bias pulls toward the interesting model, not the correct one.

How to read your data before picking a baseline

Before running any model, spend ten minutes on four questions.

Does the series have a trend? Plot the series and its rolling mean over a long window (at least two or three seasonal cycles). If the rolling mean is clearly not flat, you need a baseline that handles trend—drift, or seasonal-naive with drift. A trend that is unstable or nonlinear is a signal that linear trend models will underperform, and you should be especially cautious about drift on a series with structural breaks.

Does the series have seasonality? Run an autocorrelation function (ACF) plot and look for peaks at the seasonal lag. If the ACF shows significant spikes at lags that correspond to a regular period—7, 12, 52—the series is seasonal and the seasonal-naive is your minimum baseline. If the ACF decays slowly without clear seasonal peaks, seasonality may not be the dominant feature.

What is the noise level? Compute the coefficient of variation (standard deviation divided by mean). High-noise, low-signal series (intermittent demand, certain financial series) favour simpler baselines because complex models overfit noise. If CV > 1, treat the series with caution and lean toward simpler baselines.

How long is the series? Short series cannot reliably estimate the parameters of complex models. As a rough rule of thumb, if you have fewer than two full seasonal cycles of training data, the seasonal-naive forecast is likely to be competitive with anything that tries to estimate a seasonal pattern from scratch.

See the time series overview for guidance on decomposition and the tools available for each of these diagnostic steps.

The discipline: ship the baseline first

The discipline that separates strong forecasting teams from weak ones is not knowledge of advanced methods—it is the procedural commitment to establishing and documenting the baseline before anything else is built.

This means the following in practice. Write the baseline as code, not as a mental benchmark. Run it on the same train/test split you intend to use for your real model. Record its MASE, MAE, and any other metrics your stakeholders care about. Treat those numbers as the performance floor. Only then build anything more complex, and only promote a model to production if it clears that floor on a proper out-of-sample backtest.

The backtest design matters as much as the model. Walk-forward validation—where you retrain the model at each point in time using only the data available at that point—produces holdout errors that actually reflect production performance. A single train/test split is a starting point, not a production evaluation.

import numpy as np

def naive_forecast(series, h=1):
    """Return h-step naive forecast (last value repeated)."""
    return np.full(h, series[-1])

def seasonal_naive_forecast(series, period, h=1):
    """Return h-step seasonal naive forecast."""
    return np.array([series[-(period - (i % period))] for i in range(h)])

def drift_forecast(series, h=1):
    """Return h-step drift forecast (naive + average trend)."""
    slope = (series[-1] - series[0]) / (len(series) - 1)
    return series[-1] + slope * np.arange(1, h + 1)

def mase(actual, forecast, insample, period=1):
    """Mean Absolute Scaled Error relative to seasonal-naive in-sample errors."""
    naive_errors = np.abs(np.diff(insample, n=period))
    scale = np.mean(naive_errors)
    return np.mean(np.abs(actual - forecast)) / scale

The functions above are not production code—they are reference implementations that make the definitions concrete. A real production baseline should handle missing values, irregular timestamps, and edge cases around the start of the series. But the logic should be this simple.

Frequently asked questions

Q: Is the naive forecast really competitive with ARIMA on financial data?

On daily or higher-frequency financial price series, the naive forecast (last value repeated) is competitive with and often better than ARIMA on short horizons. This is consistent with the efficient market hypothesis—if prices followed a predictable pattern, that pattern would be arbitraged away. ARIMA tends to add more value on volume, volatility, and macroeconomic series where mean-reversion or autocorrelation structure is present and stable. Always verify on your specific series.

Q: Should I use MASE or MAPE for comparing forecasts to baselines?

MAPE (Mean Absolute Percentage Error) is widely used but has a known flaw: it is undefined or unstable when the actual values pass through or near zero, and it is asymmetric in a way that penalises over-forecasts more than under-forecasts. MASE does not have these problems and scales naturally to any series. Use MASE as your primary metric for baseline comparison. MAPE is fine as a communication metric for stakeholders who find percentages intuitive, but it should not drive model selection decisions.

Q: My complex model beats the baseline on training data but not on the holdout. What is happening?

Overfitting, almost certainly. Complex models have many parameters and will fit the training data well by construction. The baseline, having zero learned parameters, does not overfit. The gap between training and holdout performance is your overfitting penalty. The fix is more regularisation, less model complexity, or a longer training window—but most importantly, trust the holdout and not the training performance when making model selection decisions.

Q: At what point is it actually worth building a more complex model?

When you have a clean holdout evaluation showing MASE < 1 on out-of-sample data, the improvement is large enough to matter to the business decision downstream, and the maintenance cost of the more complex model is justified by that improvement. If the MASE improvement is marginal—say, 0.95 versus 0.98—a careful cost-benefit analysis often favours keeping the simpler model in production. Complexity has operational costs: retraining pipelines, monitoring for distributional shift, debugging when the model behaves unexpectedly. A simple baseline that performs almost as well is frequently the right production choice.