Accuracy lies on imbalanced data

A bank’s data science team spent six weeks training a neural network on twelve months of transaction data. They tuned hyperparameters, added dropout, tried three different optimizers. The final model scored 99.2 percent accuracy on the holdout set. They shipped it, proud and slightly smug.

The fraud team filed a complaint within two weeks. The model was catching almost nothing. After investigation, the post-mortem contained one sentence that made everyone feel sick: “Out of 2.4 million transactions, 99.1 percent were legitimate. The model learned to say ‘not fraud’ every single time.”

A trivially stupid rule — refuse to flag anything, ever — would have scored 99.1 percent accuracy. The six-week model had earned 0.1 percentage points of actual lift, at the cost of every real fraud case sliding through undetected. Accuracy had not measured performance. It had measured the imbalance of the dataset and handed it back to them dressed up as a score.

The trick accuracy plays

Accuracy is the fraction of all predictions that were correct. When the data is balanced — roughly equal numbers of positives and negatives — that fraction is informative. When one class dominates, accuracy becomes a measure of how often the majority class appears, dressed up as model quality.

This is not a failure of arithmetic. The formula is correct. The problem is that accuracy treats every mistake as equally costly and every correct prediction as equally valuable. On a dataset where 99 out of 100 rows are negative, getting those 99 right contributes 99 percent to the score regardless of what you do with the one positive. The positive is a rounding error inside the accuracy calculation, even though catching it might be the entire business purpose of the model.

The confusion matrix exists to break that aggregation apart.

The confusion matrix: four cells, one honest picture

The confusion matrix (a 2x2 table that counts every possible combination of prediction and ground truth) is the bedrock of classification evaluation. Every metric you use for classification is derived from it.

For binary classification with a positive class (fraud, disease, failure) and a negative class (legitimate, healthy, operating normally), the four cells are:

True Positive (TP): model said positive, it was positive. A caught fraud.
False Positive (FP): model said positive, it was negative. A legitimate transaction wrongly flagged.
False Negative (FN): model said negative, it was positive. A fraud that slipped through.
True Negative (TN): model said negative, it was negative. A clean transaction correctly ignored.

Accuracy is (TP + TN) / (TP + TN + FP + FN). On a dataset with 99,000 negatives and 1,000 positives, a model that predicts negative for every row has TP = 0, FP = 0, FN = 1,000, TN = 99,000. Accuracy: 99,000 / 100,000 = 99%. Frauds caught: zero.

The always-negative model. Every fraud escapes. The 99% accuracy score is technically correct and operationally meaningless.

Looking at the four cells instead of the aggregate score, the model’s uselessness is immediately obvious. That is what the confusion matrix is for: it refuses to let the majority class hide the minority class.

Precision: the cost of a false alarm

Precision is the fraction of flagged cases that were actually positive. In terms of the matrix: TP / (TP + FP).

Think of it from the fraud investigator’s desk. Every flagged transaction lands in their queue for review. If precision is low, the queue is full of legitimate customers who are annoyed that their card was declined or their account was frozen. High false-positive rates burn investigator time, erode customer trust, and in medical screening they expose healthy people to follow-up procedures that carry their own risks.

The always-negative model achieves perfect precision — technically. With zero positive predictions, the denominator TP + FP is zero, making precision undefined. More honest models that flag something will trade off precision against their false-alarm rate. A model that flags only the ten transactions it is most certain are fraud might achieve precision of 0.95, catching 9.5 true frauds for every 10 flags. Whether that is good enough depends entirely on what the false alarm costs.

Recall: the cost of a miss

Recall (also called sensitivity or the true positive rate) is the fraction of actual positives that the model found. In matrix terms: TP / (TP + FN).

This is the number that matters most when a miss is catastrophic. In cancer screening, a false negative means a tumor grows undetected for another year. In fraud detection, every false negative is money lost and a compromised customer. The always-negative model achieves 0 recall, which is the mathematical statement that it catches nothing at all.

Recall is also what the 99% accuracy number was hiding. No amount of high accuracy can compensate for zero recall on the positive class, because the entire purpose of the model is to find positives.

The tradeoff: why you cannot just maximize both

Precision and recall are in fundamental tension, and understanding why is more useful than memorizing the fact.

Every classifier uses a threshold. For probabilistic models (logistic regression, gradient-boosted trees, neural networks), the model outputs a score between 0 and 1 representing its confidence that the case is positive. You then choose a cutoff: anything above it gets flagged as positive. That cutoff is not discovered by training — it is chosen by you, after training.

Lower the threshold and you flag more things. More of the real frauds get caught (recall rises) but more legitimate transactions get swept up too (precision falls). Raise the threshold and precision improves — the flags you do raise are more reliable — but you miss more real cases and recall falls.

This is not a bug or a limitation. It is the structure of the problem. The optimal threshold depends on the relative cost of false positives and false negatives in your specific domain, and those costs are almost never equal. A fraud model at a consumer bank probably tolerates some false alarms (a customer can call and get the flag lifted) but cannot tolerate high false-negative rates on large transactions. A cancer screening model probably optimizes heavily for recall — miss no one — and accepts a higher false-positive rate because the follow-up test is confirmatory, not harmful. These are business decisions, not statistical ones.

Moving the threshold left catches more positives (recall rises) at the cost of more false alarms (precision falls). The chosen threshold encodes a business judgment about which error costs more.

F1: a single number that respects the tradeoff

The F1 score is the harmonic mean of precision and recall: 2 * (precision * recall) / (precision + recall). The harmonic mean (as opposed to the arithmetic mean) has a useful property: it punishes extreme values. A model with precision 1.0 and recall 0.0 gets an F1 of 0, not 0.5. This is the right behavior — a model that catches nothing is useless regardless of how pure its empty flag list is.

F1 is a reasonable single summary when you have no strong prior about which error costs more. It is not perfect — it weights precision and recall equally, which is rarely true in practice — but it is a vast improvement over accuracy on imbalanced data, and it is the default metric to reach for when someone asks “so how good is the classifier?”

There is a generalization called F-beta where beta is a parameter: at beta = 2 you weight recall twice as heavily as precision, which suits the fraud case; at beta = 0.5 you weight precision more heavily, which suits cases where false alarms are expensive. The formula is (1 + beta^2) * (precision * recall) / ((beta^2 * precision) + recall). The underlying intuition is the same: tune the weight of the two errors to reflect your domain’s actual cost structure.

What this means for how you build models

The metric you optimize shapes what the model learns. Most training frameworks accept a loss function, and precision, recall, and F1 are not directly differentiable — you cannot use them as training losses. But you can use them as evaluation metrics during hyperparameter search and model selection, and that is what matters: the criterion by which you choose between candidate models.

This is why reporting accuracy on an imbalanced dataset in a model card or a product demo is not just a naive mistake. It is a communication failure. The number looks good, it is not wrong in the narrow sense, and it will mislead everyone who reads it unless they already know to be suspicious. Senior practitioners instinctively ask “what is the class distribution?” every time they see an accuracy number on a classification task. That skepticism is earned.

The confusion matrix is what grounds the conversation. Four cells, plainly labeled, cannot hide behind aggregate arithmetic. TP = 0 is readable by a non-technical stakeholder. “The model caught zero fraud cases” is a sentence anyone can understand.

Start there. Every other metric is a weighted summary of those four cells, designed for a specific cost structure in a specific domain. Once you know the cells, you can choose the summary that matches the decision you are actually making — rather than the one that makes the model look most impressive.

The bank from the opening of this piece eventually rebuilt the system. They set recall as the primary metric, accepted a precision of about 0.61 (roughly one in three flags was a genuine fraud), and reduced fraudulent losses by 74 percent in the first quarter. The accuracy of the new system was 98.3 percent — lower than the model that caught nothing. Nobody cared.