How does Naive Bayes work, and why is it called 'naive'?

Naive Bayes applies Bayes' theorem to classify by computing the posterior probability of each class given the features. It is naive because it assumes all features are conditionally independent given the class label — an assumption that is almost never true in practice, yet the classifier still works surprisingly well for text and other sparse data.

Why is Naive Bayes called 'naive,' and why does it still work so well for text classification?

It's 'naive' because it assumes all features are conditionally independent given the class, which is almost never literally true. It still works for text because, even when the independence assumption is violated, the predicted class (the argmax) is often correct even if the probability estimates are miscalibrated. It's also fast, needs little data, and handles high-dimensional sparse word counts well.

What is the zero-probability problem in Naive Bayes and how do you fix it?

If a feature value never appears with a given class in training, its conditional probability is zero, and since Naive Bayes multiplies probabilities, the whole posterior for that class becomes zero regardless of other evidence. The fix is Laplace (additive) smoothing, which adds a small count to every feature-class combination so no probability is ever exactly zero. This is essential for text where many words are unseen per class.

Walk me through Bayes' theorem with a disease-screening base-rate example.

Bayes' theorem updates a prior probability with new evidence: P(H|E) = P(E|H) P(H) / P(E). In disease testing, ignoring the low base rate (prior) makes a positive test look far more alarming than it really is — most positives are false positives when the disease is rare.

Naive Bayes — GATE DA

What you'll learn

Predict the class that maximises P(class)·∏ P(featureᵢ | class)

Why 'naive' means the conditional-independence assumption — features independent given the class

Counting parameters for a Naive Bayes model with K classes and D features

Computing a posterior and a misclassification probability (a real 2025 question)

Last lesson left us wanting a classifier that speaks in probabilities — one that scores each class by how likely the evidence is under it, rather than hoarding every training point and counting votes. Bayes’ theorem already hands us that score: it turns P(evidence | class) and a prior into P(class | evidence). Naive Bayes wires that one rule into a full classifier — to label a new point, score every class and pick the winner.

The trouble is the evidence is now several features at once. Computing their exact joint likelihood, P(x₁, x₂, …, x_D | c), would need a table of probabilities that grows exponentially with the number of features — hopeless past a handful. So Naive Bayes makes one bold simplification to get around it, and the surprise of the whole method is how well it works anyway.

The decision rule

Drop the evidence denominator P(x) from Bayes’ rule — it is the same for every class, so it cannot change which class comes out largest. Naive Bayes then predicts the class with the biggest prior times likelihood:

Score every class by prior times the product of per-feature likelihoods; the largest score wins.

The product ∏ P(xᵢ | c) over features is the whole trick. The full joint likelihood P(x₁, x₂, …, x_D | c) is replaced by a product of one-feature terms — and that replacement is exact only if the features are independent once you know the class. Which brings us to the word “naive.”

Why “naive” — the conditional-independence assumption

“Naive” names the modelling assumption: given the class, the features are conditionally independent of one another. Formally, Naive Bayes assumes P(x₁, …, x_D | c) = ∏ᵢ P(xᵢ | c). In a spam filter this pretends the word “free” and the word “winner” appear independently once you have fixed “this email is spam” — almost never literally true, and yet the ranking of classes usually survives it intact.

The payoff is parameter frugality. For K classes and D features each taking V values:

each class-conditional table P(xᵢ | c) has V−1 free probabilities (they sum to 1),
there are D features and K classes, giving K · D · (V−1) class-conditional parameters,
plus K−1 free priors (they too sum to 1).

So a model with K = 3 classes and D = 4 binary features (V = 2) needs 3·4·1 = 12 likelihood parameters plus 2 priors = 14 in total. Modelling the full joint instead would need on the order of K·V^D numbers — exponential in D. That linear-versus-exponential gap is exactly why Naive Bayes scales to thousands of text features. In code, the count is the formula spelled out:

K, D, V = 3, 4, 2              # classes, features, values per feature
class_cond = K * D * (V - 1)   # per-feature likelihood tables
priors     = K - 1             # priors sum to 1, so K-1 are free

print("class-conditional params:", class_cond)
print("prior params:", priors)
print("total params:", class_cond + priors)

class-conditional params: 12
prior params: 2
total params: 14

How GATE asks this

A pure NAT. You are handed priors and per-feature likelihoods and asked to either pick the predicted class, compute one posterior, or report the misclassification probability (the posterior of the class you did not predict). A second flavour is a counting NAT/MCQ: “how many parameters does this Naive Bayes model have?” — answered with K·D·(V−1) + (K−1). Both appeared in 2024 and 2025.

Worked example — a real GATE DA 2025 NAT

Two classes with priors P(y1) = 1/3 and P(y2) = 2/3. For a feature value x the likelihoods are P(x | y1) = 3/4 and P(x | y2) = 1/4. Predict the class for x, and find the probability that this prediction is wrong.

Score each class with prior times likelihood — the unnormalised posteriors:

y1 :  P(x|y1)·P(y1) = (3/4)(1/3) = 1/4   = 0.2500
y2 :  P(x|y2)·P(y2) = (1/4)(2/3) = 1/6   ≈ 0.1667

1/4 > 1/6  →  predict y1.

So the likelihood wins out over the prior here. The prediction is wrong exactly when the true class is y2, so the misclassification probability is the normalised posterior P(y2 | x):

P(y2 | x) =        1/6
            ─────────────────  =   1/6   =  0.40
              1/4  +  1/6          5/12

So we predict y1 with a misclassification probability of 0.40 — the real GATE DA 2025 answer. As a check, P(y1 | x) = (1/4)/(5/12) = 0.60, and 0.60 + 0.40 = 1.

In one breath

Naive Bayes turns Bayes’ rule into a classifier — predict argmax_c P(c)·∏ᵢ P(xᵢ | c), dropping the class-independent denominator P(x) — and its “naive” conditional-independence assumption (features independent given the class) is what lets the joint likelihood factor into a product of one-feature terms, cutting the parameter count to a linear K·D·(V−1) + (K−1) instead of exponential; the assumption is usually false yet the ranking of classes survives, so it classifies well, provided you guard the one fatal flaw — a single zero likelihood that wipes a whole class out, cured by Laplace smoothing.

Practice

Quick check

0/6

Q1Recall — What exactly does the word 'naive' refer to, and which statements about it are correct? (select all that apply)select all that apply

Q2Recall — Why does Naive Bayes drop the denominator P(x) from Bayes' theorem when making a prediction?

Q3Trace — A Naive Bayes model has K=3 classes and D=4 features, each feature taking V=2 values. How many parameters total (class-conditional likelihoods plus priors)?numerical answer — type a number

Q4Trace — Priors P(y1)=1/3, P(y2)=2/3; likelihoods P(x|y1)=3/4, P(x|y2)=1/4. After predicting the class for x, what is the misclassification probability? (2 decimals)numerical answer — type a number

Q5Trace — Same setup (priors 1/3, 2/3; likelihoods 3/4, 1/4). What is the posterior P(y1 | x) of the predicted class? (2 decimals)numerical answer — type a number

Q6Create — During training, the word 'casino' never appeared in any email labelled 'ham' (not-spam), so P('casino' | ham) = 0. A test email contains 'casino'. What happens, and what is the standard fix? (select all that apply)select all that apply

A question to carry forward

Naive Bayes pays for its speed with a fib — that the features ignore each other once the class is fixed. For counting things, like words in an email, the fib is harmless. But picture features that are continuous and entwined: a person’s height and weight, which plainly rise together. Pretend those are independent and you throw away the very correlation that separates the classes.

So keep the beautiful recipe — model each class, then apply Bayes — but stop lying about independence. Describe each class instead as a smooth bell-shaped cloud in feature space, with a shape that captures how the features lean together. Here is the thread onward: when you model every class as such a cloud and force them all to share the same shape, what does the decision boundary between them turn out to be — and why does that one sharing assumption collapse a curved problem into a straight line?

Naive Bayes

What you'll learn

Before you start

The decision rule

Why “naive” — the conditional-independence assumption

How GATE asks this

Worked example — a real GATE DA 2025 NAT

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further