NLP & LLMs Medium Asked at GoogleAsked at StanfordAsked at Meta

How does GloVe differ from Word2Vec in learning word embeddings?

The short answer

GloVe (Global Vectors) builds a global co-occurrence matrix over the entire corpus and then factorizes it, directly encoding how often pairs of words co-occur. Word2Vec uses local context windows and a prediction objective, never explicitly seeing the global statistics. GloVe tends to capture linear substructures slightly better while Word2Vec handles rare words better with negative sampling.

How to think about it

GloVe intuition

GloVe (Pennington et al., 2014) starts from a corpus-wide co-occurrence matrix X where X[i][j] counts how often word j appears near word i. It then learns vectors w_i and w_j such that:

w_i · w_j + b_i + b_j ≈ log X[i][j]

This means the dot product of two word vectors directly approximates the log probability that the two words co-occur, making the arithmetic relationship between vectors (king - man + woman ≈ queen) a built-in property rather than an emergent one.

Word2Vec recap

Word2Vec never builds an explicit co-occurrence matrix. Skip-gram scans a sliding window and trains a binary classifier to distinguish true context words from random noise words. It sees each context pair once per epoch.

Key differences

Aspect	GloVe	Word2Vec
Training signal	Global matrix factorization	Local window prediction
Memory	High (co-occur matrix O(V^2))	Low (streaming)
Rare words	Worse (sparse matrix rows)	Better (negative sampling)
Analogy tasks	Slightly better	Comparable
Interpretability	Log-bilinear, principled	Black-box objective

import gensim.downloader as api

# Load pre-trained GloVe vectors (Stanford release)
glove = api.load("glove-wiki-gigaword-100")

# Semantic arithmetic
result = glove.most_similar(
    positive=["woman", "king"],
    negative=["man"],
    topn=3
)
print(result)  # 'queen' typically in top 3

Which to choose in practice: for most applications the two models perform similarly. GloVe is the standard academic baseline; Word2Vec with negative sampling is faster to train from scratch on a custom corpus. For production, pre-trained contextual embeddings (BERT, sentence-transformers) generally outperform both.

Learn it properly BERT, GPT, T5

How does GloVe differ from Word2Vec in learning word embeddings?

Keep practising

Explore further