datarekha

Naive Bayes

A classifier that applies Bayes' rule under one bold shortcut: features are conditionally independent given the class. Few parameters, fast, and a GATE DA regular.

9 min read Intermediate GATE DA Lesson 87 of 122

What you'll learn

  • Predict the class that maximises P(class)·∏ P(featureᵢ | class)
  • Why 'naive' means the conditional-independence assumption — features independent given the class
  • Counting parameters for a Naive Bayes model with K classes and D features
  • Computing a posterior and a misclassification probability (a real 2025 question)

Before you start

Bayes’ theorem gave us P(class | evidence) from P(evidence | class) and a prior. Naive Bayes turns that single rule into a full classifier: to label a new point, score every class and pick the winner. The catch is the evidence is now several features at once — and computing their joint likelihood exactly would need an impossible number of parameters. Naive Bayes makes one bold simplification to get around that, and it works surprisingly well.

The decision rule

Drop the evidence denominator P(x) from Bayes’ rule — it is the same for every class, so it cannot change which class is largest. Naive Bayes predicts the class with the biggest prior times likelihood:

=argmaxclass cP(c) · ∏ P(xᵢ | c)predictionpriorper-feature likelihoods
Score every class by prior times the product of per-feature likelihoods; the largest score wins.

The product ∏ P(xᵢ | c) over features is the whole trick. The full joint likelihood P(x₁, x₂, …, x_D | c) is replaced by a product of one-feature terms. That replacement is exact only if the features are independent once you know the class — which brings us to the word “naive”.

Why “naive” — the conditional-independence assumption

“Naive” names the modelling assumption: given the class, the features are conditionally independent of one another. Formally Naive Bayes assumes P(x₁, …, x_D | c) = ∏ᵢ P(xᵢ | c). In a spam filter this pretends the word “free” and the word “winner” appear independently once you fix “this email is spam” — almost never literally true, yet the ranking of classes usually survives it.

The payoff is parameter frugality. For K classes and D features each taking V values:

  • each class-conditional table P(xᵢ | c) has V−1 free probabilities (they sum to 1),
  • there are D features and K classes, giving K · D · (V−1) class-conditional parameters,
  • plus K−1 free priors (they too sum to 1).

So a model with K = 3 classes and D = 4 binary features (V = 2) needs 3·4·1 = 12 likelihood parameters plus 2 priors = 14 in total. Modelling the full joint instead would need on the order of K·V^D numbers — exponential in D. That linear-vs-exponential gap is exactly why Naive Bayes scales to thousands of text features.

How GATE asks this

A pure NAT. You are handed priors and per-feature likelihoods and asked to either pick the predicted class, compute one posterior, or report the misclassification probability (the posterior of the class you did not predict). A second flavour is a counting NAT/MCQ: “how many parameters does this Naive Bayes model have?” — answered with K·D·(V−1) + (K−1). Both appeared in 2024 and 2025.

Worked example — a real 2025 question

Two classes with priors P(y1) = 1/3 and P(y2) = 2/3. For a feature value x the likelihoods are P(x | y1) = 3/4 and P(x | y2) = 1/4. Predict the class for x, and find the probability that this prediction is wrong.

Score each class with prior times likelihood (the unnormalised posteriors):

y1 :  P(x|y1)·P(y1) = (3/4)(1/3) = 1/4   = 0.2500
y2 :  P(x|y2)·P(y2) = (1/4)(2/3) = 1/6   ≈ 0.1667

1/4 > 1/6  →  predict y1.

The prediction is wrong exactly when the true class is y2, so the misclassification probability is the normalised posterior P(y2 | x):

P(y2 | x) =        1/6
            ─────────────────  =   1/6   =  12/30  =  0.40
              1/4  +  1/6          5/12

So we predict y1 and the misclassification probability = 0.40. (This is a real GATE DA 2025 NAT — the answer is 0.40.) As a check, P(y1 | x) = (1/4)/(5/12) = 0.60, and 0.60 + 0.40 = 1.

Quick check

Quick check

0/6
Q1Priors P(y1)=1/3, P(y2)=2/3; likelihoods P(x|y1)=3/4, P(x|y2)=1/4. After predicting the class for x, what is the misclassification probability? (2 decimals)numerical answer — type a number
Q2Same setup (priors 1/3, 2/3; likelihoods 3/4, 1/4). What is the posterior P(y1 | x) of the predicted class? (2 decimals)numerical answer — type a number
Q3A Naive Bayes model has K=3 classes and D=4 features, each feature taking V=2 values. How many parameters total (class-conditional likelihoods plus priors)?numerical answer — type a number
Q4What exactly does the word 'naive' refer to, and which statements about it are correct? (select all that apply)select all that apply
Q5During training, the word 'casino' never appeared in any email labelled 'ham' (not-spam), so P('casino' | ham) = 0. A test email contains 'casino'. What happens, and what is the standard fix? (select all that apply)select all that apply
Q6Why does Naive Bayes drop the denominator P(x) from Bayes' theorem when making a prediction?

Practice this in an interview

All questions
How does Naive Bayes work, and why is it called 'naive'?

Naive Bayes applies Bayes' theorem to classify by computing the posterior probability of each class given the features. It is naive because it assumes all features are conditionally independent given the class label — an assumption that is almost never true in practice, yet the classifier still works surprisingly well for text and other sparse data.

Walk me through Bayes' theorem with a disease-screening base-rate example.

Bayes' theorem updates a prior probability with new evidence: P(H|E) = P(E|H) P(H) / P(E). In disease testing, ignoring the low base rate (prior) makes a positive test look far more alarming than it really is — most positives are false positives when the disease is rare.

How do decision trees and gradient boosting libraries handle categorical features natively, and when is label encoding safe?

sklearn trees require numeric input and treat label-encoded integers as ordinal, which imposes a false ordering. One-hot encoding is correct but expensive for high-cardinality features. XGBoost (v2+) and LightGBM support native categorical splits that find the optimal binary partition of categories without ordinal assumptions.

How does k-nearest neighbours work, and why is it called a lazy learner?

KNN stores the entire training set and defers all computation to prediction time: for a new point it finds the k closest training examples by distance, then returns the majority class (classification) or mean value (regression). It is called lazy because there is no training phase — the model is the data itself.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content