How does Word2Vec work, and what is the difference between Skip-gram and CBOW?
Word2Vec trains a shallow neural network to predict context from a target word (Skip-gram) or a target word from its context (CBOW), learning dense vector representations as a by-product. Skip-gram works better for rare words; CBOW is faster and suits large corpora.
How to think about it
Word2Vec (Mikolov et al., 2013) exploits the distributional hypothesis — words appearing in similar contexts have similar meanings — to learn vector embeddings from unlabelled text.
Skip-gram: given a center word, predict each surrounding word within a window. Objective: maximize P(context | center). Because the model must reconstruct many context words from one signal, it captures rare word contexts well.
CBOW (Continuous Bag of Words): average the context word vectors and predict the center word. Faster, smoother embeddings, better for frequent words.
Both architectures train a single hidden layer. The learned weight matrix from input to hidden layer is the embedding table.
from gensim.models import Word2Vec
sentences = [
["the", "king", "rules", "the", "land"],
["the", "queen", "rules", "the", "kingdom"],
["man", "and", "woman", "are", "equal"],
]
# Skip-gram (sg=1); CBOW is sg=0
model = Word2Vec(sentences, vector_size=50, window=2, sg=1, min_count=1, epochs=100)
print(model.wv.most_similar("king", topn=3))
# Classic analogy test
result = model.wv.most_similar(positive=["king", "woman"], negative=["man"])
print(result[0]) # ideally close to 'queen'
Negative sampling (NS) makes training tractable: instead of a full softmax over the vocabulary, the model contrasts the true context word against a small set of randomly sampled “noise” words.