datarekha

Confusion Matrix, Precision, Recall, ROC

Accuracy hides failure on imbalanced data. The confusion matrix splits errors into the two kinds that matter — and precision, recall, F1, and AUC read straight off it.

9 min read Intermediate GATE DA Lesson 84 of 122

What you'll learn

  • The confusion matrix: TP, FP, FN, TN — the four cells every metric is built from
  • accuracy = (TP+TN)/total; precision = TP/(TP+FP); recall = TP/(TP+FN)
  • F1 is the harmonic mean of precision and recall, and why accuracy lies on imbalanced data
  • ROC curve (TPR vs FPR) and AUC as a threshold-free ranking score

Before you start

A spam filter that calls everything “not spam” is 99% accurate when 99% of mail is real — and completely useless. A single accuracy number cannot tell you which kind of mistake the model makes. The confusion matrix can: it splits every prediction into four cells, and precision, recall, F1, and AUC all read directly off them.

The confusion matrix

Cross “what the model predicted” against “what was actually true.” Each of the four cells counts one outcome:

ActualPositiveNegativePredictedPositiveNegativeTPtrue positiveFPfalse positiveFNfalse negativeTNtrue negativeDiagonal = correct (TP, TN); off-diagonal = the two error types (FP, FN).
Every classification metric is just a ratio of these four counts.
  • TP — predicted positive, actually positive (a correct hit).
  • FP — predicted positive, actually negative (a false alarm).
  • FN — predicted negative, actually positive (a miss).
  • TN — predicted negative, actually negative (a correct rejection).

The four metrics

From those four counts:

accuracy= (TP + TN) / totalprecision= TP / (TP + FP)of predicted-positive, how many rightrecall= TP / (TP + FN)of actual-positive, how many caughtF1= 2 · P · R / (P + R)
Precision and recall differ only in their denominator — predicted-positives vs actual-positives.
  • Accuracy = (TP + TN) / total — fraction of all predictions that were correct.
  • Precision = TP / (TP + FP) — of everything flagged positive, how many really were. High precision means few false alarms.
  • Recall (sensitivity, TPR) = TP / (TP + FN) — of everything actually positive, how many you caught. High recall means few misses.
  • F1 = 2 · P · R / (P + R) — the harmonic mean of precision and recall; it stays low unless both are high, so it summarises the trade-off in one number.

Drag the decision threshold and watch the four cells — and therefore every metric — move:

ROC and AUC

A classifier outputs a score; you pick a threshold to turn it into a yes/no. Sweep that threshold from strict to lenient and plot TPR (recall, TP/(TP+FN)) on the y-axis against FPR (FP/(FP+TN)) on the x-axis — that traced curve is the ROC curve. AUC (area under it) is the probability that the model ranks a random positive above a random negative: 0.5 is random guessing, 1.0 is perfect. Because AUC is threshold-free, it is a clean single-number comparison of two models’ ranking ability.

How GATE asks this

The reliable NAT hands you a populated confusion matrix and asks for one metric — precision, recall, F1, or accuracy — to two or three decimals. The recipe never varies: read off TP, FP, FN, TN, then plug into the ratio. The MSQ variant builds the same counts from a word problem and asks which statements hold — GATE DA 2026 (Q47) gave one class 20 items and the other 10, with a handful misclassified each way, then asked you to compare the two classes’ accuracy, precision, and recall. Either way you must have the precision-vs-recall denominators straight (and know accuracy lies under class imbalance).

Worked example

A binary classifier on 30 samples produces TP = 8, FP = 2, FN = 6, TN = 14. Compute accuracy, precision, recall, and F1.

Check the total first: 8 + 2 + 6 + 14 = 30. Then:

accuracy  = (TP + TN) / total = (8 + 14) / 30 = 22/30  ≈ 0.733
precision = TP / (TP + FP)    = 8 / (8 + 2)   = 8/10   = 0.800
recall    = TP / (TP + FN)    = 8 / (8 + 6)   = 8/14   ≈ 0.571
F1        = 2·P·R / (P + R)   = 2·0.8·0.571 / (0.8 + 0.571)
          = 0.914 / 1.371                              ≈ 0.667

So accuracy ≈ 0.733, precision = 0.80, recall ≈ 0.571, F1 ≈ 0.667. Note precision (0.80) beats recall (0.571): the model is cautious — when it says positive it is usually right, but it still misses 6 of the 14 actual positives.

Quick check

Quick check

0/6
Q1A confusion matrix has TP = 8, FP = 2, FN = 6, TN = 14. Compute the precision. (2 decimals)numerical answer — type a number
Q2Same matrix (TP = 8, FP = 2, FN = 6, TN = 14). Compute the recall. (3 decimals)numerical answer — type a number
Q3Same matrix (TP = 8, FP = 2, FN = 6, TN = 14, total 30). Compute the accuracy. (3 decimals)numerical answer — type a number
Q4With precision = 0.80 and recall = 0.571 (from the matrix above), compute the F1 score. (3 decimals)numerical answer — type a number
Q5A dataset is 99% negative. A model predicts 'negative' for every sample. Which statements are TRUE? (select all that apply)select all that apply
Q6Which statements about precision vs recall are correct? (select all that apply)select all that apply

Practice this in an interview

All questions
What is a confusion matrix and what four quantities does it report?

A confusion matrix tallies predictions against ground truth in a 2x2 table: true positives, true negatives, false positives, and false negatives. From those four cells every classification metric — accuracy, precision, recall, F1, specificity — can be derived. It exposes *which kind* of error a model makes, not just how often it errs.

What is the Precision-Recall curve, and why does it outperform ROC-AUC on imbalanced datasets?

The PR curve plots precision against recall as the decision threshold varies. On imbalanced datasets it is more informative than ROC because it ignores the large pool of true negatives that inflate ROC-AUC — a model that looks good on ROC can still have dismal precision, which PR-AUC immediately exposes. PR-AUC is the better metric whenever the positive class is rare and getting predictions right matters more than ranking.

What is the ROC curve and what does AUC actually measure?

The ROC curve plots True Positive Rate (recall) against False Positive Rate at every decision threshold. AUC — the area under that curve — equals the probability that the model ranks a randomly chosen positive example above a randomly chosen negative one. A random classifier scores 0.5; a perfect classifier scores 1.0.

What are the key regression metrics — MAE, RMSE, MAPE, R² — and what are their failure modes?

MAE, RMSE, MAPE, and R² each measure a different aspect of regression quality and each has a regime where it misleads. RMSE is dominated by outliers; MAE is robust but hides large-error tails; MAPE is undefined at zero and asymmetrically penalises under-prediction; R² can appear high even when absolute errors are large, and can be negative, yet is still commonly misread as a percentage-correct. Choosing the right metric requires knowing the cost structure of the prediction task.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content