What are stop words and when should you remove them?
Stop words are high-frequency function words — 'the', 'is', 'at', 'which' — that typically carry little discriminative content for tasks like classification or retrieval. Removing them reduces vocabulary size and noise, but for tasks like sentiment analysis or question answering, some function words can be semantically important and should be kept.
How to think about it
Stop words are words that appear so frequently across all documents that they contribute near-zero TF-IDF weight and dilute the signal in bag-of-words features. Classic examples: “a”, “the”, “and”, “to”, “in”, “is”.
Why remove them
- Reduces feature dimensionality, speeding up training and inference.
- Prevents high-frequency noise from dominating cosine similarity scores.
- Makes remaining features more discriminative.
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download("stopwords", quiet=True)
nltk.download("punkt", quiet=True)
text = "the quick brown fox jumps over the lazy dog"
stop = set(stopwords.words("english"))
tokens = word_tokenize(text)
filtered = [t for t in tokens if t.lower() not in stop]
print(filtered)
# ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']
When NOT to remove stop words
| Task | Reason to keep stop words |
|---|---|
| Sentiment analysis | ”not bad” vs “bad” — “not” is critical |
| Question answering | ”who”, “what”, “where” define question type |
| Named entity recognition | Context tokens help boundary detection |
| Language modelling | All tokens needed for fluency |
| Transformer fine-tuning | Self-attention handles frequency weighting itself |
Custom stop lists: domain-specific corpora need custom stop words. In a legal corpus “court”, “case”, and “law” may appear so universally they behave as stop words yet carry no discriminative signal.