datarekha
NLP & LLMs Medium Asked at GoogleAsked at MetaAsked at OpenAIAsked at Amazon

What is the key difference between Word2Vec embeddings and BERT's contextual embeddings?

The short answer

Word2Vec assigns a single static vector to each word token regardless of surrounding words, so polysemous words like 'bank' always get the same representation. BERT generates a different vector for each token depending on its full sentential context, allowing it to disambiguate meaning on the fly.

How to think about it

Static embeddings (Word2Vec, GloVe)

Training finds one embedding per vocabulary entry by optimizing a prediction objective over a large corpus. At inference, you look up the word ID in an embedding table — fast and simple, but the vector is context-free.

from gensim.models import Word2Vec

model = Word2Vec.load("word2vec.model")
# Both sentences map to the SAME vector for "bank"
v1 = model.wv["bank"]  # "river bank"
v2 = model.wv["bank"]  # "savings bank"
print((v1 == v2).all())  # True — identical

Contextual embeddings (BERT)

BERT runs a full Transformer encoder over the input sequence. Every layer attends to every other token. The output vector for token position i is a function of the entire sentence, not just the word type.

import torch
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
model.eval()

def embed(sentence):
    inputs = tokenizer(sentence, return_tensors="pt")
    with torch.no_grad():
        out = model(**inputs)
    # CLS token embedding as sentence representation
    return out.last_hidden_state[0, 0]

e1 = embed("I deposited money at the bank")
e2 = embed("We picnicked on the river bank")
import torch.nn.functional as F
print(F.cosine_similarity(e1.unsqueeze(0), e2.unsqueeze(0)).item())
# < 1.0 — different contexts yield different vectors

Comparison

AspectWord2VecBERT
Vector per wordOne (static)One per occurrence (dynamic)
Polysemy handlingNoneFull
Compute at inferenceO(1) table lookupFull forward pass
Training objectivePredict context wordsMasked LM + NSP
Downstream tasksFeature inputFine-tune end-to-end
Learn it properly BERT, GPT, T5

Keep practising

All NLP & LLMs questions

Explore further

Skip to content