datarekha
NLP & LLMs Easy Asked at AmazonAsked at Microsoft

What are stop words and when should you remove them?

The short answer

Stop words are high-frequency function words — 'the', 'is', 'at', 'which' — that typically carry little discriminative content for tasks like classification or retrieval. Removing them reduces vocabulary size and noise, but for tasks like sentiment analysis or question answering, some function words can be semantically important and should be kept.

How to think about it

Stop words are words that appear so frequently across all documents that they contribute near-zero TF-IDF weight and dilute the signal in bag-of-words features. Classic examples: “a”, “the”, “and”, “to”, “in”, “is”.

Why remove them

  • Reduces feature dimensionality, speeding up training and inference.
  • Prevents high-frequency noise from dominating cosine similarity scores.
  • Makes remaining features more discriminative.
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download("stopwords", quiet=True)
nltk.download("punkt", quiet=True)

text = "the quick brown fox jumps over the lazy dog"
stop = set(stopwords.words("english"))

tokens = word_tokenize(text)
filtered = [t for t in tokens if t.lower() not in stop]
print(filtered)
# ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']

When NOT to remove stop words

TaskReason to keep stop words
Sentiment analysis”not bad” vs “bad” — “not” is critical
Question answering”who”, “what”, “where” define question type
Named entity recognitionContext tokens help boundary detection
Language modellingAll tokens needed for fluency
Transformer fine-tuningSelf-attention handles frequency weighting itself

Custom stop lists: domain-specific corpora need custom stop words. In a legal corpus “court”, “case”, and “law” may appear so universally they behave as stop words yet carry no discriminative signal.

Keep practising

All NLP & LLMs questions

Explore further

Skip to content