How does GloVe differ from Word2Vec in learning word embeddings?
GloVe (Global Vectors) builds a global co-occurrence matrix over the entire corpus and then factorizes it, directly encoding how often pairs of words co-occur. Word2Vec uses local context windows and a prediction objective, never explicitly seeing the global statistics. GloVe tends to capture linear substructures slightly better while Word2Vec handles rare words better with negative sampling.
How to think about it
GloVe intuition
GloVe (Pennington et al., 2014) starts from a corpus-wide co-occurrence matrix X where X[i][j] counts how often word j appears near word i. It then learns vectors w_i and w_j such that:
w_i · w_j + b_i + b_j ≈ log X[i][j]
This means the dot product of two word vectors directly approximates the log probability that the two words co-occur, making the arithmetic relationship between vectors (king - man + woman ≈ queen) a built-in property rather than an emergent one.
Word2Vec recap
Word2Vec never builds an explicit co-occurrence matrix. Skip-gram scans a sliding window and trains a binary classifier to distinguish true context words from random noise words. It sees each context pair once per epoch.
Key differences
| Aspect | GloVe | Word2Vec |
|---|---|---|
| Training signal | Global matrix factorization | Local window prediction |
| Memory | High (co-occur matrix O(V^2)) | Low (streaming) |
| Rare words | Worse (sparse matrix rows) | Better (negative sampling) |
| Analogy tasks | Slightly better | Comparable |
| Interpretability | Log-bilinear, principled | Black-box objective |
import gensim.downloader as api
# Load pre-trained GloVe vectors (Stanford release)
glove = api.load("glove-wiki-gigaword-100")
# Semantic arithmetic
result = glove.most_similar(
positive=["woman", "king"],
negative=["man"],
topn=3
)
print(result) # 'queen' typically in top 3
Which to choose in practice: for most applications the two models perform similarly. GloVe is the standard academic baseline; Word2Vec with negative sampling is faster to train from scratch on a custom corpus. For production, pre-trained contextual embeddings (BERT, sentence-transformers) generally outperform both.