What is the difference between stemming and lemmatization?
The short answer
Stemming strips suffixes using heuristic rules and may produce non-words, while lemmatization uses a vocabulary and morphological analysis to return the canonical dictionary form. Lemmatization is slower but always produces valid words, making it preferable when interpretability matters.
How to think about it
Both techniques reduce inflected words to a common base to collapse vocabulary size and improve recall in retrieval tasks.
Stemming applies a sequence of suffix-stripping rules (Porter, Snowball). Fast, language-agnostic, and deterministic, but purely character-based:
- “running” → “run”
- “studies” → “studi” (not a real word)
- “generous” → “gener” (collides with “generate”)
Lemmatization consults a morphological dictionary (WordNet in NLTK) and optionally uses POS tags to pick the correct root form:
- “running” (verb) → “run”
- “studies” (verb) → “study”
- “better” (adjective) → “good”
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
nltk.download("wordnet", quiet=True)
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
words = ["running", "studies", "better"]
for w in words:
print(f"{w:12} stem={stemmer.stem(w):10} lemma={lemmatizer.lemmatize(w, pos='v')}")
# running stem=run lemma=run
# studies stem=studi lemma=study
# better stem=better lemma=better (needs pos='a' for 'good')
When to choose which
| Stemming | Lemmatization | |
|---|---|---|
| Speed | Very fast | Slower (dict lookup) |
| Output | May be non-word | Always valid word |
| Accuracy | Lower | Higher |
| Use case | Search indexing | Chatbots, QA, NLU |