NLP & LLMs Easy Asked at AmazonAsked at Microsoft

What are stop words and when should you remove them?

The short answer

Stop words are high-frequency function words — 'the', 'is', 'at', 'which' — that typically carry little discriminative content for tasks like classification or retrieval. Removing them reduces vocabulary size and noise, but for tasks like sentiment analysis or question answering, some function words can be semantically important and should be kept.

How to think about it

Stop words are words that appear so frequently across all documents that they contribute near-zero TF-IDF weight and dilute the signal in bag-of-words features. Classic examples: “a”, “the”, “and”, “to”, “in”, “is”.

Why remove them

Reduces feature dimensionality, speeding up training and inference.
Prevents high-frequency noise from dominating cosine similarity scores.
Makes remaining features more discriminative.

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download("stopwords", quiet=True)
nltk.download("punkt", quiet=True)

text = "the quick brown fox jumps over the lazy dog"
stop = set(stopwords.words("english"))

tokens = word_tokenize(text)
filtered = [t for t in tokens if t.lower() not in stop]
print(filtered)
# ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']

When NOT to remove stop words

Task	Reason to keep stop words
Sentiment analysis	”not bad” vs “bad” — “not” is critical
Question answering	”who”, “what”, “where” define question type
Named entity recognition	Context tokens help boundary detection
Language modelling	All tokens needed for fluency
Transformer fine-tuning	Self-attention handles frequency weighting itself

Custom stop lists: domain-specific corpora need custom stop words. In a legal corpus “court”, “case”, and “law” may appear so universally they behave as stop words yet carry no discriminative signal.

What are stop words and when should you remove them?

Keep practising

Explore further