NLP & LLMs Easy Asked at GoogleAsked at MetaAsked at Amazon

Why do dense word embeddings outperform one-hot vectors?

For Data Scientist ML Engineer AI / LLM Engineer

The short answer

One-hot vectors are high-dimensional, sparse, and treat all words as equidistant — they carry zero semantic information. Dense embeddings place similar words close together in a low-dimensional space, enabling models to generalize from seen words to unseen but related ones.

How to think about it

One-hot encoding represents each word as a vector of zeros with a single 1 at the word’s index. For a vocabulary of size V, every word is a V-dimensional sparse vector.

Problems:

Dimensionality: a 50,000-word vocabulary means 50,000-dimensional input — memory and compute explode.
No similarity: the dot product of any two distinct one-hot vectors is 0. “Cat” and “kitten” are as distant as “cat” and “galaxy”.
No generalization: a model trained on “cat” learns nothing transferable to “kitten”.

Dense embeddings (Word2Vec, GloVe, fastText) compress each word into a 50-300 dimensional real-valued vector learned from distributional co-occurrence:

import numpy as np
from gensim.models import Word2Vec

sentences = [["the","cat","sat"],["the","kitten","slept"],["a","dog","ran"]]
model = Word2Vec(sentences, vector_size=10, window=2, min_count=1, epochs=200)

cat = model.wv["cat"]
kitten = model.wv["kitten"]
dog = model.wv["dog"]

def cosine(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print(cosine(cat, kitten))  # high
print(cosine(cat, dog))     # moderate

Comparison summary

Property	One-hot	Dense embedding
Dimensions	V (50k+)	50-300
Sparse	Yes	No
Semantic similarity	None	Encoded
Generalization	None	Strong

Practical impact: classifiers trained on embeddings require far fewer labelled examples because the embedding already encodes prior knowledge about word relationships. A model that sees “cat” examples implicitly understands “kitten” examples too.

Learn it properly BERT, GPT, T5

Why do dense word embeddings outperform one-hot vectors?

Keep practising

Explore further