Naive Bayes
A classifier that applies Bayes' rule under one bold shortcut: features are conditionally independent given the class. Few parameters, fast, and a GATE DA regular.
What you'll learn
- Predict the class that maximises P(class)·∏ P(featureᵢ | class)
- Why 'naive' means the conditional-independence assumption — features independent given the class
- Counting parameters for a Naive Bayes model with K classes and D features
- Computing a posterior and a misclassification probability (a real 2025 question)
Before you start
Bayes’ theorem gave us P(class | evidence) from P(evidence | class) and a prior.
Naive Bayes turns that single rule into a full classifier: to label a new point,
score every class and pick the winner. The catch is the evidence is now several
features at once — and computing their joint likelihood exactly would need an
impossible number of parameters. Naive Bayes makes one bold simplification to get
around that, and it works surprisingly well.
The decision rule
Drop the evidence denominator P(x) from Bayes’ rule — it is the same for every
class, so it cannot change which class is largest. Naive Bayes predicts the class
with the biggest prior times likelihood:
The product ∏ P(xᵢ | c) over features is the whole trick. The full joint
likelihood P(x₁, x₂, …, x_D | c) is replaced by a product of one-feature terms.
That replacement is exact only if the features are independent once you know the
class — which brings us to the word “naive”.
Why “naive” — the conditional-independence assumption
“Naive” names the modelling assumption: given the class, the features are
conditionally independent of one another. Formally Naive Bayes assumes
P(x₁, …, x_D | c) = ∏ᵢ P(xᵢ | c). In a spam filter this pretends the word
“free” and the word “winner” appear independently once you fix “this email is spam”
— almost never literally true, yet the ranking of classes usually survives it.
The payoff is parameter frugality. For K classes and D features each taking
V values:
- each class-conditional table
P(xᵢ | c)hasV−1free probabilities (they sum to 1), - there are
Dfeatures andKclasses, givingK · D · (V−1)class-conditional parameters, - plus
K−1free priors (they too sum to 1).
So a model with K = 3 classes and D = 4 binary features (V = 2) needs
3·4·1 = 12 likelihood parameters plus 2 priors = 14 in total. Modelling the
full joint instead would need on the order of K·V^D numbers — exponential in D.
That linear-vs-exponential gap is exactly why Naive Bayes scales to thousands of
text features.
How GATE asks this
A pure NAT. You are handed priors and per-feature likelihoods and asked to either
pick the predicted class, compute one posterior, or report the misclassification
probability (the posterior of the class you did not predict). A second flavour is
a counting NAT/MCQ: “how many parameters does this Naive Bayes model have?” — answered
with K·D·(V−1) + (K−1). Both appeared in 2024 and 2025.
Worked example — a real 2025 question
Two classes with priors
P(y1) = 1/3andP(y2) = 2/3. For a feature valuexthe likelihoods areP(x | y1) = 3/4andP(x | y2) = 1/4. Predict the class forx, and find the probability that this prediction is wrong.
Score each class with prior times likelihood (the unnormalised posteriors):
y1 : P(x|y1)·P(y1) = (3/4)(1/3) = 1/4 = 0.2500
y2 : P(x|y2)·P(y2) = (1/4)(2/3) = 1/6 ≈ 0.1667
1/4 > 1/6 → predict y1.
The prediction is wrong exactly when the true class is y2, so the
misclassification probability is the normalised posterior P(y2 | x):
P(y2 | x) = 1/6
───────────────── = 1/6 = 12/30 = 0.40
1/4 + 1/6 5/12
So we predict y1 and the misclassification probability = 0.40. (This is a
real GATE DA 2025 NAT — the answer is 0.40.) As a check, P(y1 | x) = (1/4)/(5/12) = 0.60, and 0.60 + 0.40 = 1.
Quick check
Quick check
Practice this in an interview
All questionsNaive Bayes applies Bayes' theorem to classify by computing the posterior probability of each class given the features. It is naive because it assumes all features are conditionally independent given the class label — an assumption that is almost never true in practice, yet the classifier still works surprisingly well for text and other sparse data.
Bayes' theorem updates a prior probability with new evidence: P(H|E) = P(E|H) P(H) / P(E). In disease testing, ignoring the low base rate (prior) makes a positive test look far more alarming than it really is — most positives are false positives when the disease is rare.
sklearn trees require numeric input and treat label-encoded integers as ordinal, which imposes a false ordering. One-hot encoding is correct but expensive for high-cardinality features. XGBoost (v2+) and LightGBM support native categorical splits that find the optimal binary partition of categories without ordinal assumptions.
KNN stores the entire training set and defers all computation to prediction time: for a new point it finds the k closest training examples by distance, then returns the majority class (classification) or mean value (regression). It is called lazy because there is no training phase — the model is the data itself.