datarekha
Machine Learning Easy Asked at GoogleAsked at Amazon

How does Naive Bayes work, and why is it called 'naive'?

The short answer

Naive Bayes applies Bayes' theorem to classify by computing the posterior probability of each class given the features. It is naive because it assumes all features are conditionally independent given the class label — an assumption that is almost never true in practice, yet the classifier still works surprisingly well for text and other sparse data.

How to think about it

Naive Bayes is a probabilistic classifier that is fast, low-data, and interpretable. The interview goal is to explain both the Bayes’ theorem machinery and the independence assumption — and when that assumption costs you performance.

The mechanism

Bayes’ theorem: P(y | x) ∝ P(y) · P(x | y)

For a feature vector x = (x_1, …, x_d), the naive conditional independence assumption says:

P(x | y) = P(x_1 | y) · P(x_2 | y) · … · P(x_d | y)

This collapses estimation of a joint distribution over d features into d independent 1-D distributions — tractable even from a small training set.

The predicted class is the one that maximises the posterior:

ŷ = argmax_y P(y) · ∏ P(x_j | y)

Computationally, products are replaced with log-sums to avoid floating-point underflow.

Variants

| Variant | Feature type | P(x_j | y) modeled as | |---|---|---| | Gaussian NB | Continuous | Normal distribution | | Multinomial NB | Counts (e.g., word counts) | Multinomial | | Bernoulli NB | Binary (word presence) | Bernoulli | | Complement NB | Text, imbalanced classes | Complement of class counts |

Why it still works

Even though features are correlated (word “machine” and “learning” co-occur), the decision boundary only requires getting the ranking of posteriors right, not calibrated probabilities. In high-dimensional sparse spaces (text), the independence assumption introduces less error than the noise from small sample estimates of the full joint distribution would.

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ("tfidf", TfidfVectorizer()),
    ("nb",    MultinomialNB(alpha=1.0)),   # alpha = Laplace smoothing
])
pipe.fit(docs_train, y_train)

alpha=1.0 (Laplace smoothing) prevents zero probabilities for unseen tokens.

Keep practising

All Machine Learning questions

Explore further

Skip to content