What is the key difference between Word2Vec embeddings and BERT's contextual embeddings?
Word2Vec assigns a single static vector to each word token regardless of surrounding words, so polysemous words like 'bank' always get the same representation. BERT generates a different vector for each token depending on its full sentential context, allowing it to disambiguate meaning on the fly.
How to think about it
Static embeddings (Word2Vec, GloVe)
Training finds one embedding per vocabulary entry by optimizing a prediction objective over a large corpus. At inference, you look up the word ID in an embedding table — fast and simple, but the vector is context-free.
from gensim.models import Word2Vec
model = Word2Vec.load("word2vec.model")
# Both sentences map to the SAME vector for "bank"
v1 = model.wv["bank"] # "river bank"
v2 = model.wv["bank"] # "savings bank"
print((v1 == v2).all()) # True — identical
Contextual embeddings (BERT)
BERT runs a full Transformer encoder over the input sequence. Every layer attends to every other token. The output vector for token position i is a function of the entire sentence, not just the word type.
import torch
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
model.eval()
def embed(sentence):
inputs = tokenizer(sentence, return_tensors="pt")
with torch.no_grad():
out = model(**inputs)
# CLS token embedding as sentence representation
return out.last_hidden_state[0, 0]
e1 = embed("I deposited money at the bank")
e2 = embed("We picnicked on the river bank")
import torch.nn.functional as F
print(F.cosine_similarity(e1.unsqueeze(0), e2.unsqueeze(0)).item())
# < 1.0 — different contexts yield different vectors
Comparison
| Aspect | Word2Vec | BERT |
|---|---|---|
| Vector per word | One (static) | One per occurrence (dynamic) |
| Polysemy handling | None | Full |
| Compute at inference | O(1) table lookup | Full forward pass |
| Training objective | Predict context words | Masked LM + NSP |
| Downstream tasks | Feature input | Fine-tune end-to-end |