datarekha
Machine Learning Medium Asked at GoogleAsked at AmazonAsked at MicrosoftAsked at Spotify

What are the main approaches for converting raw text into features for a machine learning model?

The short answer

Bag-of-words (CountVectorizer) and TF-IDF represent text as sparse word-count vectors and are fast baselines for linear models. Word embeddings (Word2Vec, GloVe) produce dense, semantically meaningful vectors by averaging word representations. Contextual embeddings from transformer encoders (BERT, sentence-transformers) capture full sentence semantics and outperform bag-of-words on most tasks at higher compute cost.

How to think about it

The right approach depends on the task complexity, dataset size, and latency budget. Start simple; move to heavier representations only if the simpler ones are insufficient.

Bag-of-words and TF-IDF

CountVectorizer creates a column per vocabulary token and fills in the raw count per document. Ignores word order and semantics but works well for short-text classification with linear models.

TF-IDF (Term Frequency-Inverse Document Frequency) down-weights tokens that appear in nearly every document (common words) and up-weights tokens that are rare and therefore more discriminative.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ("tfidf", TfidfVectorizer(max_features=50_000, ngram_range=(1, 2),
                              sublinear_tf=True, min_df=3)),
    ("clf",   LogisticRegression(max_iter=500)),
])
pipe.fit(texts_train, labels_train)

ngram_range=(1, 2) adds bigrams, capturing short phrases. sublinear_tf=True replaces raw count with 1 + log(count), reducing the influence of very frequent tokens.

Static word embeddings

Word2Vec, GloVe, and FastText map each word to a dense vector (typically 100–300 dimensions) that captures semantic similarity. A simple document representation is the mean of its word vectors. Much smaller than TF-IDF on large vocabularies, and generalizes to unseen synonyms.

import numpy as np

# Assumes pre-loaded GloVe dict: {word: np.array}
def embed(text, glove):
    vectors = [glove[w] for w in text.lower().split() if w in glove]
    return np.mean(vectors, axis=0) if vectors else np.zeros(300)

X_train_emb = np.array([embed(t, glove) for t in texts_train])

Sentence and contextual embeddings

sentence-transformers produces a single fixed-length embedding per sentence using a fine-tuned BERT model. These capture word order and context (e.g., “bank” near “river” vs. “bank” near “loan”).

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
X_train_emb = model.encode(texts_train, batch_size=64, show_progress_bar=True)

Use these embeddings as features for any downstream classifier.

Comparison

ApproachDimensionalitySemanticSpeedOOV handling
TF-IDFHigh (sparse)NoFastDrop
Avg word vectorsLow-mid (dense)PartialFastSubword (FastText)
Transformer embeddingsLow-mid (dense)YesSlowYes

Preprocessing before any of these

Lowercase, remove HTML/special characters, and optionally lemmatize. For TF-IDF, also remove stop words (stop_words="english") — but do not remove them for transformer models, which were trained on full sentences.

Keep practising

All Machine Learning questions

Explore further

Skip to content