NLP & LLMs Easy Asked at AmazonAsked at Microsoft

What is the difference between stemming and lemmatization?

The short answer

Stemming strips suffixes using heuristic rules and may produce non-words, while lemmatization uses a vocabulary and morphological analysis to return the canonical dictionary form. Lemmatization is slower but always produces valid words, making it preferable when interpretability matters.

How to think about it

Both techniques reduce inflected words to a common base to collapse vocabulary size and improve recall in retrieval tasks.

Stemming applies a sequence of suffix-stripping rules (Porter, Snowball). Fast, language-agnostic, and deterministic, but purely character-based:

“running” → “run”
“studies” → “studi” (not a real word)
“generous” → “gener” (collides with “generate”)

Lemmatization consults a morphological dictionary (WordNet in NLTK) and optionally uses POS tags to pick the correct root form:

“running” (verb) → “run”
“studies” (verb) → “study”
“better” (adjective) → “good”

import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

nltk.download("wordnet", quiet=True)

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

words = ["running", "studies", "better"]
for w in words:
    print(f"{w:12} stem={stemmer.stem(w):10} lemma={lemmatizer.lemmatize(w, pos='v')}")
# running      stem=run        lemma=run
# studies      stem=studi      lemma=study
# better       stem=better     lemma=better  (needs pos='a' for 'good')

When to choose which

	Stemming	Lemmatization
Speed	Very fast	Slower (dict lookup)
Output	May be non-word	Always valid word
Accuracy	Lower	Higher
Use case	Search indexing	Chatbots, QA, NLU

What is the difference between stemming and lemmatization?

Keep practising

Explore further