datarekha
NLP & LLMs Easy Asked at AmazonAsked at Microsoft

What is the difference between stemming and lemmatization?

The short answer

Stemming strips suffixes using heuristic rules and may produce non-words, while lemmatization uses a vocabulary and morphological analysis to return the canonical dictionary form. Lemmatization is slower but always produces valid words, making it preferable when interpretability matters.

How to think about it

Both techniques reduce inflected words to a common base to collapse vocabulary size and improve recall in retrieval tasks.

Stemming applies a sequence of suffix-stripping rules (Porter, Snowball). Fast, language-agnostic, and deterministic, but purely character-based:

  • “running” → “run”
  • “studies” → “studi” (not a real word)
  • “generous” → “gener” (collides with “generate”)

Lemmatization consults a morphological dictionary (WordNet in NLTK) and optionally uses POS tags to pick the correct root form:

  • “running” (verb) → “run”
  • “studies” (verb) → “study”
  • “better” (adjective) → “good”
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

nltk.download("wordnet", quiet=True)

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

words = ["running", "studies", "better"]
for w in words:
    print(f"{w:12} stem={stemmer.stem(w):10} lemma={lemmatizer.lemmatize(w, pos='v')}")
# running      stem=run        lemma=run
# studies      stem=studi      lemma=study
# better       stem=better     lemma=better  (needs pos='a' for 'good')

When to choose which

StemmingLemmatization
SpeedVery fastSlower (dict lookup)
OutputMay be non-wordAlways valid word
AccuracyLowerHigher
Use caseSearch indexingChatbots, QA, NLU

Keep practising

All NLP & LLMs questions

Explore further

Skip to content