datarekha

Naive Bayes

Bayes' rule plus one bold assumption — that features are independent — gives a fast, surprisingly strong classifier, especially for text. Why 'naive' works, and the Laplace-smoothing gotcha.

7 min read Beginner Machine Learning Lesson 11 of 33

What you'll learn

  • How Naive Bayes turns Bayes' rule into a product of per-feature likelihoods
  • Why the (wrong) independence assumption still works so well for text
  • Laplace smoothing and why you compute in log-space

Before you start

Naive Bayes is the classifier that shouldn’t work but does. It rests on an assumption that’s almost always false — that your features are independent — yet it’s fast, needs little data, and remains a strong baseline for text. It’s also the cleanest illustration of Bayes’ rule in all of ML, which makes it a favorite interview topic.

Bayes’ rule, applied to classification

For a class C and observed features, Bayes’ rule says:

P(C | features) ∝ P(C) × P(features | C)
              posterior   prior     likelihood

To classify, compute that score for each class and pick the largest. The hard part is P(features | C) — the joint probability of all features together, which needs an impossible amount of data to estimate directly.

The “naive” leap

Naive Bayes makes one bold simplification: assume the features are conditionally independent given the class. Then the joint likelihood becomes a simple product of per-feature likelihoods:

P(features | C) = P(f₁|C) × P(f₂|C) × P(f₃|C) × …

Now each term is easy to estimate by counting. Toggle words in an email and watch the per-word likelihoods multiply into a posterior:

The independence assumption is wrong — “new” and “york” are obviously not independent — but it rarely hurts the ranking that decides the class, even when the probability estimates themselves are off. That’s the surprise: a wrong model that still makes the right call.

Quick check

Quick check

0/3
Q1What is the 'naive' assumption in Naive Bayes?
Q2A word in your test email never appeared in the spam training data. Without smoothing, what happens?
Q3Why does Naive Bayes work well for text despite the independence assumption being false?

Next

You’ve now met the core algorithms. The big leap in tabular performance comes from combining them — bagging, boosting & stacking.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
Why is Naive Bayes called 'naive,' and why does it still work so well for text classification?

It's 'naive' because it assumes all features are conditionally independent given the class, which is almost never literally true. It still works for text because, even when the independence assumption is violated, the predicted class (the argmax) is often correct even if the probability estimates are miscalibrated. It's also fast, needs little data, and handles high-dimensional sparse word counts well.

How does Naive Bayes work, and why is it called 'naive'?

Naive Bayes applies Bayes' theorem to classify by computing the posterior probability of each class given the features. It is naive because it assumes all features are conditionally independent given the class label — an assumption that is almost never true in practice, yet the classifier still works surprisingly well for text and other sparse data.

What is the zero-probability problem in Naive Bayes and how do you fix it?

If a feature value never appears with a given class in training, its conditional probability is zero, and since Naive Bayes multiplies probabilities, the whole posterior for that class becomes zero regardless of other evidence. The fix is Laplace (additive) smoothing, which adds a small count to every feature-class combination so no probability is ever exactly zero. This is essential for text where many words are unseen per class.

Bagging vs boosting — how do they differ, and when does each help?

Bagging trains many independent models in parallel on bootstrap samples and averages them, which mainly reduces variance; boosting trains models sequentially so each corrects its predecessor's errors, which mainly reduces bias. Use bagging (e.g. random forests) when your base learner is high-variance and overfits; use boosting (e.g. gradient boosting) when you need to squeeze out bias and maximize accuracy, accepting more tuning and overfitting risk.

Related lessons

Explore further

Skip to content