Always beat the baseline first

A startup once shipped a fraud-detection model to production, celebrated a 96.2 percent accuracy, and then spent three months wondering why the fraud team said it was useless. The answer was embarrassingly simple: 96.1 percent of transactions in the dataset were legitimate. Predicting “not fraud” for every single row — no model, no features, no GPU, just one line of Python — would have matched the production system almost exactly. The model had learned, at enormous cost, to be barely better than nothing.

That story is not unusual. It is the default outcome when teams skip the most important step in applied machine learning: building the baseline first.

What a baseline actually is

A baseline is the dumbest predictor you can construct without looking at the features — the number or label that requires zero modeling skill to produce. For a regression (predicting a continuous value like house price or revenue), it is the mean of the training labels. For a binary classification (fraud or not, churn or not), it is the majority class — always predict the more common outcome. For a time series, it is the last observed value, or a naive seasonal repeat. For a ranking problem, it is the existing business rule or alphabetical order.

The baseline does not use features. It does not have weights. It cannot overfit. It is the floor, and your model must beat it by enough to be worth the complexity it adds.

The embarrassingly strong majority-class baseline

Consider a medical screening dataset where 92 percent of patients do not have a rare condition. A junior team trains a gradient-boosted classifier — a powerful ensemble method that builds many decision trees sequentially, each correcting the errors of the last — and achieves 91.4 percent accuracy. They are disappointed. The baseline, predicting “negative” for every patient, achieves 92 percent accuracy. The model is not just failing to help; it is strictly worse than no model at all.

This is not pathological. It is common in any domain with class imbalance (a dataset where one outcome is far more frequent than another): fraud detection, medical diagnosis, equipment failure, customer churn, content moderation. In each of these, the negative class is almost always overwhelmingly dominant.

The resolution is not to throw out the model. It is to recognize that accuracy is the wrong metric here, and that the baseline just told you that in five seconds. Once you switch to precision-recall or the F1 score (the harmonic mean of precision and recall, ranging from 0 to 1), the majority-class baseline collapses to near-zero recall and the real gap becomes visible. The baseline did not embarrass the model; it sharpened the question.

A model scoring 0.914 against a baseline of 0.903 has earned only 0.011 in real lift. That gap is the only number worth defending.

Why the baseline is a leak detector

There is a second reason to run the baseline before anything else, and it is more important than the evaluation signal: data leakage.

Leakage (when information from the future or from the label itself bleeds into the training features) inflates scores catastrophically and silently. A model trained on a leaky dataset will score 0.99 on a held-out test set and 0.51 in production. The baseline cannot be leaked into. It uses no features. It produces the same prediction regardless of what columns exist in the dataset. So if your full model outperforms the baseline by forty absolute points on a problem that should be difficult, that gap is a red flag, not a trophy. Something in the feature set already contains the answer.

Senior practitioners have seen this enough times that a “suspiciously high gap over baseline” is now a standard sanity check on feature engineering reviews. If the model is dramatically better than the baseline in cross-validation but collapses on real traffic, the autopsy almost always finds leakage. The baseline would have told you to look harder before you shipped.

The baseline is a forcing function for the right metric

Establishing a baseline is not just arithmetic. It is the act of picking your measure of success before you start optimizing. This sounds obvious, but in practice it almost never happens by default.

Teams reach for accuracy because it is the metric classification libraries report by default. Accuracy looks good on imbalanced problems because the majority class dominates it. Once you build the baseline and see that it achieves 93 percent accuracy without learning anything, you are forced — immediately, before writing a single line of model code — to choose a metric that actually distinguishes skill from luck. That might be the area under the precision-recall curve (a summary of how precision and recall trade off at all decision thresholds), or Cohen’s kappa (which corrects accuracy for the expected agreement due to chance), or a business-specific cost function that weights false negatives differently from false positives.

The baseline did not just give you a comparison point. It invalidated the metric you were about to use and sent you to find a better one. That is worth more than a week of hyperparameter tuning (adjusting the settings that control how a model trains, such as learning rate or tree depth) that would have optimized a number disconnected from actual value.

What “good” means gets redefined

Here is the uncomfortable truth about model evaluation in industry: absolute scores mean almost nothing without the baseline. A 0.91 F1 score on a credit default prediction problem might be mediocre (if the naive predictor also achieves 0.88) or extraordinary (if the problem is genuinely hard and the naive predictor is at 0.50). The number alone cannot tell you which.

The baseline redefines “good” as “good relative to what requires no skill.” That is the only definition of good that matters in practice. When a model goes to a business stakeholder, the question they are really asking is: compared to what we could do without this system, how much better is it? The baseline answers that question exactly.

This redefinition has a side effect that practitioners rarely mention: it makes progress visible. Teams that skip the baseline often feel stuck because they cannot tell if moving from 0.89 to 0.91 F1 is meaningful. If the baseline is at 0.87, that two-point jump doubled the real lift. If the baseline is at 0.905, that two-point jump is actually a regression in useful signal. The absolute numbers look similar; the interpretation is opposite.

Four problem types, four baselines. Build the one in the left column before any model in the right column.

The 80/1 rule

Here is a less discussed phenomenon: for a surprising number of real-world problems, the baseline is not just useful for calibration — it is directly shippable.

In demand forecasting for retail, the “same week last year” naive baseline beats expensive deep-learning forecasters on roughly 40 percent of SKUs (stock-keeping units — individual product variants tracked in inventory), because those products have clean seasonal patterns and the incremental modeling complexity adds noise rather than signal. In recommender systems (algorithms that suggest content or products to users), the “most popular items” baseline outperforms collaborative filtering on cold-start users — those with no history — because there is no personal signal to work with yet.

In each of these cases, the team that shipped the baseline first is already generating value while the model is still being trained. The team that skipped the baseline is shipping a complex system on day thirty that the business cannot distinguish from the baseline they never built.

The ratio is not always 80/1, but the structure is consistent: the baseline costs a trivial amount of engineering time and delivers a significant fraction of the production value. The model delivers the remaining fraction, and that fraction is worth pursuing — but it is smaller than most teams estimate, and it comes later than most plans assume.

How senior engineers use this in practice

The pattern at strong ML teams is not mysterious. The first artifact in any modeling project is a baseline results table: a spreadsheet or notebook cell that records the baseline metric, the metric type, the evaluation split (the held-out portion of data used to measure performance), and the date it was produced. Every subsequent model must beat that number by a threshold agreed upon before training begins.

That threshold is the conversation no one wants to have early and everyone wants to have late. “Significantly better” means different things to the model team and to the product team. To the model team, a 2 percent F1 improvement is a real result with statistical significance. To the product team, a 2 percent F1 improvement on a model that runs inference for 200 milliseconds and costs 15 dollars per thousand predictions might not move a business metric. Establishing the baseline early surfaces that conversation before any model weights exist.

This is why the baseline first principle is not a modeling guideline. It is a communication tool. It translates “our model is good” into “our model produces X additional correct predictions per thousand that the current system misses, which, at our transaction volume, corresponds to Y recovered revenue per week.” That is the sentence that gets a model deployed.

The failure mode that looks like success

There is one scenario where teams skip the baseline and appear to succeed: the problem is hard, the dataset is balanced, the baseline is genuinely weak, and the model correctly earns a large gap. In that scenario, skipping the baseline feels fine because the model looks good on every cut.

The cost appears later, when the model is replaced. The successor team starts from scratch trying to understand what the previous system was actually doing, whether the reported numbers were real, and what the improvement trajectory has been over time. The baseline was never established, so there is no historical floor, no way to know whether successive model versions were actually improving on anything meaningful, and no defensible answer to “how much better is this than doing nothing.”

Every model in a production system eventually becomes the baseline for the next one. If you never wrote down the floor at the beginning, the entire lineage of improvement is undocumented and unverifiable.

What this actually demands of you

Building the baseline first sounds easy, and technically it is: five lines of Python, at most. The hard part is the organizational commitment it implies. It means agreeing on a metric before you like what the numbers say. It means publishing a result that looks unimpressive and explaining why it is the right starting point. It means having the conversation about “is the gap actually worth it” before you have invested months in a model you want to defend.

Most ML practitioners have built the baseline after the model, as a retroactive sanity check. That sequence defeats the purpose. The baseline only forces the right decisions when it is built first, before any model exists to anchor your judgment about what counts as progress.

The teams that do this consistently — that put baseline metrics in their project kick-off documents, that block model deployment on beating the baseline by a predefined margin, that maintain a living results table from day one — are the teams whose models actually ship and stay in production. Not because their models are more sophisticated. Because they are honest about what the models are actually doing.

A model that beats a strong baseline by a small margin, measured correctly, on the right metric, is more valuable than a model that beats a weak or absent baseline by an enormous margin. The first number is a fact. The second is a story.

Build the baseline first. Then earn the gap.