What do 'parameters' mean in a language model and what do they actually store?
Parameters are the learnable floating-point numbers — weights and biases — that define a neural network's behaviour. In a transformer LLM, they are distributed across token embedding matrices, multi-head attention projection matrices (Q, K, V, O), and feed-forward network layers. They encode compressed statistical associations between tokens learned during training, not explicit facts or rules.
How to think about it
When someone says “a 70-billion-parameter model,” they mean the neural network contains 70 billion individual floating-point numbers that were learned during training. These numbers collectively determine what the model outputs for any given input.
Where the parameters live in a transformer
A transformer decoder block has four main parameter groups:
1. Token embeddings — a matrix of shape [vocab size x embedding dimension]. Each row is a learned vector representation for one vocabulary token. For a 50,000-token vocabulary with embedding dimension 4096, this matrix alone holds 200 million parameters.
2. Attention projections — for each layer and each attention head, four weight matrices project the token embedding into query (Q), key (K), value (V), and output (O) spaces. These capture which tokens should attend to which others.
3. Feed-forward network (FFN) — two large linear layers with a non-linearity between them. In most architectures the inner dimension is 4x the embedding dimension. This is often where the bulk of parameters reside: two-thirds of GPT-2’s parameters are in FFN layers.
4. Layer norm and biases — small by comparison but present in each layer.
What parameters actually store
Parameters are not a lookup table of facts. They are weights that transform one vector into another in a composition of matrix multiplications. Factual associations are distributed across millions of weights — there is no single weight that “stores” the capital of France. This distributional encoding is why:
- The model can generalise to novel phrasings.
- It is hard to surgically edit a single fact without affecting others.
- Catastrophic forgetting occurs when fine-tuning overwrites widely distributed weights.
Parameter count and memory
At fp16 (2 bytes per parameter), a 7B-parameter model requires ~14 GB of GPU memory just for weights — before accounting for activations, KV cache, and optimizer states.
memory (GB) ≈ params (B) × bytes_per_param / 1e9
7B model @ fp16: 7 × 2 = 14 GB