What is the zero-probability problem in Naive Bayes and how do you fix it?
If a feature value never appears with a given class in training, its conditional probability is zero, and since Naive Bayes multiplies probabilities, the whole posterior for that class becomes zero regardless of other evidence. The fix is Laplace (additive) smoothing, which adds a small count to every feature-class combination so no probability is ever exactly zero. This is essential for text where many words are unseen per class.
How to think about it
The crisp answer
Naive Bayes computes a class score by multiplying the conditional probabilities of each feature. If any one of those is zero — because that feature value never co-occurred with the class in training — the entire product collapses to zero, vetoing the class no matter how strong the other evidence is. The standard fix is Laplace (additive) smoothing.
Why it happens
Because the posterior factorizes as a product, P(class | x) ∝ P(class) × Π P(featureᵢ | class), a single zero factor zeros the whole thing. In text this is rampant: a test document almost always contains words never seen in some class’s training set.
The fix in words
Add a constant α (usually 1) to every count and add α × (number of possible values) to the denominator. Now every feature-class probability is small but strictly positive, so unseen combinations dampen rather than veto the prediction. This is sometimes called add-one smoothing; α < 1 (Lidstone smoothing) smooths more gently.
Concrete example
Classifying an email as spam where the word “refinance” never appeared in your ham training data: without smoothing, any ham email containing “refinance” gets P(ham) = 0. With Laplace smoothing it just gets a low likelihood, letting other words still influence the decision.
The common trap
Forgetting smoothing entirely, or applying it inconsistently between the numerator and denominator. A related practical trick is to sum log-probabilities instead of multiplying raw probabilities, which avoids floating-point underflow from many tiny factors. Follow-up: “Why log space?” — products of many small probabilities underflow; summing logs is numerically stable and monotonic.