What is the zero-probability problem in Naive Bayes and how do you fix it?

If a feature value never appears with a given class in training, its conditional probability is zero, and since Naive Bayes multiplies probabilities, the whole posterior for that class becomes zero regardless of other evidence. The fix is Laplace (additive) smoothing, which adds a small count to every feature-class combination so no probability is ever exactly zero. This is essential for text where many words are unseen per class.

Why is Naive Bayes called 'naive,' and why does it still work so well for text classification?

It's 'naive' because it assumes all features are conditionally independent given the class, which is almost never literally true. It still works for text because, even when the independence assumption is violated, the predicted class (the argmax) is often correct even if the probability estimates are miscalibrated. It's also fast, needs little data, and handles high-dimensional sparse word counts well.

How does Naive Bayes work, and why is it called 'naive'?

Naive Bayes applies Bayes' theorem to classify by computing the posterior probability of each class given the features. It is naive because it assumes all features are conditionally independent given the class label — an assumption that is almost never true in practice, yet the classifier still works surprisingly well for text and other sparse data.

Bagging vs boosting — how do they differ, and when does each help?

Bagging trains many independent models in parallel on bootstrap samples and averages them, which mainly reduces variance; boosting trains models sequentially so each corrects its predecessor's errors, which mainly reduces bias. Use bagging (e.g. random forests) when your base learner is high-variance and overfits; use boosting (e.g. gradient boosting) when you need to squeeze out bias and maximize accuracy, accepting more tuning and overfitting risk.

Naive Bayes — Machine Learning

Naive Bayes is the classifier that shouldn’t work but does. It rests on an assumption that’s almost always false — that your features are independent — yet it’s fast, needs little data, and remains a strong baseline for text. It’s also the cleanest illustration of Bayes’ rule in all of ML, which makes it a favorite interview topic.

Bayes’ rule, applied to classification

For a class C and observed features, Bayes’ rule says:

P(C | features) ∝ P(C) × P(features | C)
              posterior   prior     likelihood

To classify, compute that score for each class and pick the largest. The hard part is P(features | C) — the joint probability of all features together, which needs an impossible amount of data to estimate directly.

The “naive” leap

Naive Bayes makes one bold simplification: assume the features are conditionally independent given the class. Then the joint likelihood becomes a simple product of per-feature likelihoods:

P(features | C) = P(f₁|C) × P(f₂|C) × P(f₃|C) × …

Now each term is easy to estimate by counting. Trace it on the email “free prize now”, with these per-word likelihoods learned from training (and an even P(spam) = P(ham) = 0.5 prior):

word	likelihood if spam	likelihood if ham
free	0.40	0.02
prize	0.30	0.01
now	0.20	0.10

Multiply each column through, times the 0.5 prior:

spam score ∝ 0.5 × 0.40 × 0.30 × 0.20 = 0.0120
ham score ∝ 0.5 × 0.02 × 0.01 × 0.10 = 0.00001

Normalize: P(spam) = 0.0120 / (0.0120 + 0.00001) ≈ 0.999 → SPAM. Three cheap multiplications, no joint distribution in sight.

The independence assumption is wrong — “new” and “york” are obviously not independent — but it rarely hurts the ranking that decides the class, even when the probability estimates themselves are off. That’s the surprise: a wrong model that still makes the right call.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

texts = ["win free money now", "free entry winner", "team meeting at noon",
         "project update attached", "claim your prize free", "lunch tomorrow?",
         "urgent click this link", "review the report please"]
labels = [1, 1, 0, 0, 1, 0, 1, 0]   # 1 = spam, 0 = ham

# Bag-of-words counts → Multinomial Naive Bayes (with Laplace smoothing built in)
clf = make_pipeline(CountVectorizer(), MultinomialNB())
clf.fit(texts, labels)

for t in ["free prize click now", "meeting about the project"]:
    p = clf.predict_proba([t])[0][1]
    print(f"{'SPAM' if p>0.5 else 'HAM ':4} ({p:.2f})  ->  {t!r}")

SPAM (0.96)  ->  'free prize click now'
HAM  (0.10)  ->  'meeting about the project'

In one breath

Naive Bayes applies Bayes’ rule: P(C | features) ∝ P(C) × P(features | C) — score each class, pick the largest.
The “naive” assumption — features are conditionally independent given the class — turns the intractable joint likelihood into a simple product of per-feature likelihoods you estimate by counting.
The assumption is usually false, but it rarely flips the argmax, so the class ranking stays right even when the probabilities are off — a wrong model that makes the right call.
The zero-frequency trap: one unseen word → likelihood 0 → zeroes the whole product; fix with Laplace (add-one) smoothing, and compute in log-space to avoid underflow (sklearn’s MultinomialNB does both).
It’s the fast text baseline (spam, sentiment, topics) — one-pass training, tiny data, huge vocabularies. Variants: Multinomial (counts), Bernoulli (presence), Gaussian (continuous).

Quick check

0/3

Q1What is the 'naive' assumption in Naive Bayes?

Q2A word in your test email never appeared in the spam training data. Without smoothing, what happens?

Q3Why does Naive Bayes work well for text despite the independence assumption being false?

You’ve now met the core algorithms. The big leap in tabular performance comes from combining them — bagging, boosting & stacking.

Naive Bayes

What you'll learn

Before you start

Bayes’ rule, applied to classification

The “naive” leap

In one breath

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further