Machine Learning Medium Asked at GoogleAsked at AmazonAsked at MicrosoftAsked at LinkedIn

What is the F1 score, why use the harmonic mean, and when is it the wrong metric?

For Data Scientist ML Engineer Data Analyst AI / LLM Engineer

The short answer

F1 is the harmonic mean of precision and recall: 2PR/(P+R). The harmonic mean penalises extreme imbalance between the two — a model with 1.0 precision and 0.01 recall gets F1 = 0.02, not 0.505. F1 is the wrong metric when the classes are heavily imbalanced or when the costs of false positives and false negatives differ sharply, in which case F-beta, PR-AUC, or a cost-weighted metric is more appropriate.

How to think about it

Cover the formula, the harmonic-mean motivation, the generalisation to F-beta, and the cases where F1 misleads.

The formula

F1 = 2 * Precision * Recall / (Precision + Recall)

Equivalently: F1 = 2TP / (2TP + FP + FN).

Why harmonic mean, not arithmetic mean?

The arithmetic mean of precision and recall — (P + R) / 2 — rewards a model that nails one metric while completely ignoring the other. A model with precision = 1.0 and recall = 0.01 has an arithmetic mean of 0.505, suggesting it’s decent. The harmonic mean gives 0.02 — correctly signalling that the model is nearly useless.

The harmonic mean is always dominated by the smaller of the two values. That’s the intent: both precision and recall must be simultaneously high for F1 to be high.

F-beta: weighting one direction

When the costs are asymmetric, use F-beta:

F-beta = (1 + beta²) * P * R / (beta² * P + R)

beta = 1 → standard F1, equal weight.
beta = 2 → recall counts twice as much (good for medical screening, fraud detection).
beta = 0.5 → precision counts twice as much (good for spam filtering, precision-first retrieval).

When F1 is the wrong metric

Situation	Better metric
Severe class imbalance (less than 1% positives)	PR-AUC or macro-averaged F1 per class
Need threshold-free comparison	PR-AUC or ROC-AUC
Cost of FP and FN differ a lot	F-beta, or explicit cost matrix
Multi-class with unequal class sizes	Macro or weighted F1, not micro
True negatives matter (e.g. content moderation both directions)	Matthews Correlation Coefficient (MCC)

Macro vs. micro F1

In multi-class problems:

Micro F1 pools all TP, FP, FN before computing — dominated by the largest class.
Macro F1 computes F1 per class then averages — treats every class equally regardless of size.
Weighted F1 weights per-class F1 by support — a compromise for imbalanced multi-class.

Learn it properly Metrics that matter