Why do dense word embeddings outperform one-hot vectors?
One-hot vectors are high-dimensional, sparse, and treat all words as equidistant — they carry zero semantic information. Dense embeddings place similar words close together in a low-dimensional space, enabling models to generalize from seen words to unseen but related ones.
How to think about it
One-hot encoding represents each word as a vector of zeros with a single 1 at the word’s index. For a vocabulary of size V, every word is a V-dimensional sparse vector.
Problems:
- Dimensionality: a 50,000-word vocabulary means 50,000-dimensional input — memory and compute explode.
- No similarity: the dot product of any two distinct one-hot vectors is 0. “Cat” and “kitten” are as distant as “cat” and “galaxy”.
- No generalization: a model trained on “cat” learns nothing transferable to “kitten”.
Dense embeddings (Word2Vec, GloVe, fastText) compress each word into a 50-300 dimensional real-valued vector learned from distributional co-occurrence:
import numpy as np
from gensim.models import Word2Vec
sentences = [["the","cat","sat"],["the","kitten","slept"],["a","dog","ran"]]
model = Word2Vec(sentences, vector_size=10, window=2, min_count=1, epochs=200)
cat = model.wv["cat"]
kitten = model.wv["kitten"]
dog = model.wv["dog"]
def cosine(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(cosine(cat, kitten)) # high
print(cosine(cat, dog)) # moderate
Comparison summary
| Property | One-hot | Dense embedding |
|---|---|---|
| Dimensions | V (50k+) | 50-300 |
| Sparse | Yes | No |
| Semantic similarity | None | Encoded |
| Generalization | None | Strong |
Practical impact: classifiers trained on embeddings require far fewer labelled examples because the embedding already encodes prior knowledge about word relationships. A model that sees “cat” examples implicitly understands “kitten” examples too.