How does Naive Bayes work, and why is it called 'naive'?
Naive Bayes applies Bayes' theorem to classify by computing the posterior probability of each class given the features. It is naive because it assumes all features are conditionally independent given the class label — an assumption that is almost never true in practice, yet the classifier still works surprisingly well for text and other sparse data.
How to think about it
Naive Bayes is a probabilistic classifier that is fast, low-data, and interpretable. The interview goal is to explain both the Bayes’ theorem machinery and the independence assumption — and when that assumption costs you performance.
The mechanism
Bayes’ theorem: P(y | x) ∝ P(y) · P(x | y)
For a feature vector x = (x_1, …, x_d), the naive conditional independence assumption says:
P(x | y) = P(x_1 | y) · P(x_2 | y) · … · P(x_d | y)
This collapses estimation of a joint distribution over d features into d independent 1-D distributions — tractable even from a small training set.
The predicted class is the one that maximises the posterior:
ŷ = argmax_y P(y) · ∏ P(x_j | y)
Computationally, products are replaced with log-sums to avoid floating-point underflow.
Variants
| Variant | Feature type | P(x_j | y) modeled as | |---|---|---| | Gaussian NB | Continuous | Normal distribution | | Multinomial NB | Counts (e.g., word counts) | Multinomial | | Bernoulli NB | Binary (word presence) | Bernoulli | | Complement NB | Text, imbalanced classes | Complement of class counts |
Why it still works
Even though features are correlated (word “machine” and “learning” co-occur), the decision boundary only requires getting the ranking of posteriors right, not calibrated probabilities. In high-dimensional sparse spaces (text), the independence assumption introduces less error than the noise from small sample estimates of the full joint distribution would.
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
pipe = Pipeline([
("tfidf", TfidfVectorizer()),
("nb", MultinomialNB(alpha=1.0)), # alpha = Laplace smoothing
])
pipe.fit(docs_train, y_train)
alpha=1.0 (Laplace smoothing) prevents zero probabilities for unseen tokens.