NLP & LLMs Medium Asked at GoogleAsked at MetaAsked at OpenAIAsked at Amazon

What is the key difference between Word2Vec embeddings and BERT's contextual embeddings?

For Data Scientist ML Engineer AI / LLM Engineer

The short answer

Word2Vec assigns a single static vector to each word token regardless of surrounding words, so polysemous words like 'bank' always get the same representation. BERT generates a different vector for each token depending on its full sentential context, allowing it to disambiguate meaning on the fly.

How to think about it

Static embeddings (Word2Vec, GloVe)

Training finds one embedding per vocabulary entry by optimizing a prediction objective over a large corpus. At inference, you look up the word ID in an embedding table — fast and simple, but the vector is context-free.

from gensim.models import Word2Vec

model = Word2Vec.load("word2vec.model")
# Both sentences map to the SAME vector for "bank"
v1 = model.wv["bank"]  # "river bank"
v2 = model.wv["bank"]  # "savings bank"
print((v1 == v2).all())  # True — identical

Contextual embeddings (BERT)

BERT runs a full Transformer encoder over the input sequence. Every layer attends to every other token. The output vector for token position i is a function of the entire sentence, not just the word type.

import torch
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
model.eval()

def embed(sentence):
    inputs = tokenizer(sentence, return_tensors="pt")
    with torch.no_grad():
        out = model(**inputs)
    # CLS token embedding as sentence representation
    return out.last_hidden_state[0, 0]

e1 = embed("I deposited money at the bank")
e2 = embed("We picnicked on the river bank")
import torch.nn.functional as F
print(F.cosine_similarity(e1.unsqueeze(0), e2.unsqueeze(0)).item())
# < 1.0 — different contexts yield different vectors

Comparison

Aspect	Word2Vec	BERT
Vector per word	One (static)	One per occurrence (dynamic)
Polysemy handling	None	Full
Compute at inference	O(1) table lookup	Full forward pass
Training objective	Predict context words	Masked LM + NSP
Downstream tasks	Feature input	Fine-tune end-to-end

Learn it properly BERT, GPT, T5

What is the key difference between Word2Vec embeddings and BERT's contextual embeddings?

Keep practising

Explore further