What are the main approaches for converting raw text into features for a machine learning model?
Bag-of-words (CountVectorizer) and TF-IDF represent text as sparse word-count vectors and are fast baselines for linear models. Word embeddings (Word2Vec, GloVe) produce dense, semantically meaningful vectors by averaging word representations. Contextual embeddings from transformer encoders (BERT, sentence-transformers) capture full sentence semantics and outperform bag-of-words on most tasks at higher compute cost.
How to think about it
The right approach depends on the task complexity, dataset size, and latency budget. Start simple; move to heavier representations only if the simpler ones are insufficient.
Bag-of-words and TF-IDF
CountVectorizer creates a column per vocabulary token and fills in the raw count per document. Ignores word order and semantics but works well for short-text classification with linear models.
TF-IDF (Term Frequency-Inverse Document Frequency) down-weights tokens that appear in nearly every document (common words) and up-weights tokens that are rare and therefore more discriminative.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
pipe = Pipeline([
("tfidf", TfidfVectorizer(max_features=50_000, ngram_range=(1, 2),
sublinear_tf=True, min_df=3)),
("clf", LogisticRegression(max_iter=500)),
])
pipe.fit(texts_train, labels_train)
ngram_range=(1, 2) adds bigrams, capturing short phrases. sublinear_tf=True replaces raw count with 1 + log(count), reducing the influence of very frequent tokens.
Static word embeddings
Word2Vec, GloVe, and FastText map each word to a dense vector (typically 100–300 dimensions) that captures semantic similarity. A simple document representation is the mean of its word vectors. Much smaller than TF-IDF on large vocabularies, and generalizes to unseen synonyms.
import numpy as np
# Assumes pre-loaded GloVe dict: {word: np.array}
def embed(text, glove):
vectors = [glove[w] for w in text.lower().split() if w in glove]
return np.mean(vectors, axis=0) if vectors else np.zeros(300)
X_train_emb = np.array([embed(t, glove) for t in texts_train])
Sentence and contextual embeddings
sentence-transformers produces a single fixed-length embedding per sentence using a fine-tuned BERT model. These capture word order and context (e.g., “bank” near “river” vs. “bank” near “loan”).
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
X_train_emb = model.encode(texts_train, batch_size=64, show_progress_bar=True)
Use these embeddings as features for any downstream classifier.
Comparison
| Approach | Dimensionality | Semantic | Speed | OOV handling |
|---|---|---|---|---|
| TF-IDF | High (sparse) | No | Fast | Drop |
| Avg word vectors | Low-mid (dense) | Partial | Fast | Subword (FastText) |
| Transformer embeddings | Low-mid (dense) | Yes | Slow | Yes |
Preprocessing before any of these
Lowercase, remove HTML/special characters, and optionally lemmatize. For TF-IDF, also remove stop words (stop_words="english") — but do not remove them for transformer models, which were trained on full sentences.