Naive Bayes
Bayes' rule plus one bold assumption — that features are independent — gives a fast, surprisingly strong classifier, especially for text. Why 'naive' works, and the Laplace-smoothing gotcha.
What you'll learn
- How Naive Bayes turns Bayes' rule into a product of per-feature likelihoods
- Why the (wrong) independence assumption still works so well for text
- Laplace smoothing and why you compute in log-space
Before you start
Naive Bayes is the classifier that shouldn’t work but does. It rests on an assumption that’s almost always false — that your features are independent — yet it’s fast, needs little data, and remains a strong baseline for text. It’s also the cleanest illustration of Bayes’ rule in all of ML, which makes it a favorite interview topic.
Bayes’ rule, applied to classification
For a class C and observed features, Bayes’ rule says:
P(C | features) ∝ P(C) × P(features | C)
posterior prior likelihood
To classify, compute that score for each class and pick the largest. The hard
part is P(features | C) — the joint probability of all features together,
which needs an impossible amount of data to estimate directly.
The “naive” leap
Naive Bayes makes one bold simplification: assume the features are conditionally independent given the class. Then the joint likelihood becomes a simple product of per-feature likelihoods:
P(features | C) = P(f₁|C) × P(f₂|C) × P(f₃|C) × …
Now each term is easy to estimate by counting. Toggle words in an email and watch the per-word likelihoods multiply into a posterior:
The independence assumption is wrong — “new” and “york” are obviously not independent — but it rarely hurts the ranking that decides the class, even when the probability estimates themselves are off. That’s the surprise: a wrong model that still makes the right call.
Quick check
Quick check
Next
You’ve now met the core algorithms. The big leap in tabular performance comes from combining them — bagging, boosting & stacking.
Practice this in an interview
All questionsIt's 'naive' because it assumes all features are conditionally independent given the class, which is almost never literally true. It still works for text because, even when the independence assumption is violated, the predicted class (the argmax) is often correct even if the probability estimates are miscalibrated. It's also fast, needs little data, and handles high-dimensional sparse word counts well.
Naive Bayes applies Bayes' theorem to classify by computing the posterior probability of each class given the features. It is naive because it assumes all features are conditionally independent given the class label — an assumption that is almost never true in practice, yet the classifier still works surprisingly well for text and other sparse data.
If a feature value never appears with a given class in training, its conditional probability is zero, and since Naive Bayes multiplies probabilities, the whole posterior for that class becomes zero regardless of other evidence. The fix is Laplace (additive) smoothing, which adds a small count to every feature-class combination so no probability is ever exactly zero. This is essential for text where many words are unseen per class.
Bagging trains many independent models in parallel on bootstrap samples and averages them, which mainly reduces variance; boosting trains models sequentially so each corrects its predecessor's errors, which mainly reduces bias. Use bagging (e.g. random forests) when your base learner is high-variance and overfits; use boosting (e.g. gradient boosting) when you need to squeeze out bias and maximize accuracy, accepting more tuning and overfitting risk.