What are embeddings and why are they central to modern deep learning?
An embedding is a dense, learned vector representation of a discrete or high-dimensional object — a word, image, user, product — in a continuous low-dimensional space. Proximity in embedding space reflects semantic or behavioural similarity, making embeddings a universal interface between raw data and neural networks.
How to think about it
Before embeddings, discrete inputs like words were one-hot encoded — a vocabulary of 50 000 words becomes a 50 000-dimensional sparse vector where 49 999 entries are zero. This is memory-inefficient and contains zero semantic information: “king” and “queen” are as far apart as “king” and “banana”.
What an embedding layer does
It is a learned lookup table: integer index → dense vector of dimension d. During training the vectors shift so that similar items end up close in d-dimensional space.
import torch.nn as nn
# Map a vocabulary of 10,000 tokens to 256-dimensional vectors
embed = nn.Embedding(num_embeddings=10_000, embedding_dim=256)
token_ids = torch.tensor([42, 7, 1337]) # batch of token indices
vecs = embed(token_ids) # shape: (3, 256)
Why they matter
- Compression — 256 floats instead of 50 000 binary values.
- Generalisation — the model can reason about unseen token combinations because nearby embeddings share gradient updates during training.
- Reuse — pretrained embeddings (Word2Vec, GloVe, BERT’s token embeddings) transfer across tasks. This is the backbone of transfer learning in NLP.
- Multimodal alignment — systems like CLIP train image and text encoders jointly so that
embed("a dog") ≈ embed(photo_of_dog). This enables zero-shot retrieval.
Applications beyond NLP
Recommender systems embed users and items (Netflix, Spotify). Knowledge graphs embed entities and relations. Molecule encoders embed atoms and bonds for drug discovery. Anywhere you have discrete or complex objects, an embedding gives the network something to work with.
The dot product or cosine similarity between two embedding vectors is a fast, differentiable proxy for semantic similarity — which is why vector databases and approximate nearest-neighbour search are now infrastructure primitives.