Distillation: Teaching a Small Model to Mimic a Big One
Quantization shrinks a model's weights. Distillation trains a brand-new, smaller model to copy a big one's behaviour — fewer layers, genuinely faster. How it works, and when to pick it over quantization.
What you'll learn
- How a small student learns from a large teacher — soft labels, temperature, and 'dark knowledge'
- The distillation loss: matching the teacher's whole probability distribution, not just its top answer
- Quantization vs distillation — which compression strategy to use, and why the best models use both
Before you start
The wall quantization can’t break
Quantization made your 70B model fit in 40 GB. But it is still a 70B model — seventy billion multiply-adds for every single token you generate. On a phone, inside a browser, or under a tight latency budget, “smaller weights” is not enough. You need fewer weights.
That is a different problem with a different answer: train a brand-new, smaller model — fewer layers, fewer parameters — to copy what the big one does. The big model is the teacher; the small one is the student. This is knowledge distillation. The clever part is not the shrinking — it is how the student learns.
Learn from the whole answer, not just the pick
Here is the insight that makes it work. Show a well-trained classifier a photo of a dog. It does not just say “dog” — it outputs a probability for every class:
dog 90% · wolf 8% · cat 1.9% · ship 0.09% · car 0.01%
The hard label — the ground truth — is only “dog.” Train the student on that one-hot answer and you throw almost everything away. But look at what the teacher actually said: a dog is a little like a wolf, barely like a cat, and nothing like a car. Those ratios are real knowledge about how the world is shaped — Geoffrey Hinton called it “dark knowledge.” It is free supervision the student could never get from the bare word “dog.”
So distillation trains the student to match the teacher’s entire probability distribution — its soft labels — not just the winning class.
The temperature knob
There is a catch: a confident teacher’s distribution is too sharp to be useful. Ninety-nine percent on the right answer drowns out the interesting ratios underneath. So we soften it with a temperature T inside the softmax:
p_i = softmax(z_i / T)
where z are the teacher’s raw logits. At T = 1 you get the normal, spiky distribution. Raise T and the probabilities flatten, lifting the tiny ones into view — exactly the “dog is more like a wolf than a car” signal the student needs to learn. Drag the slider above and watch the dark knowledge appear. Teacher and student use the same T during training; at inference the student drops back to T = 1.
The distillation loss
The student is trained against two targets at once:
- Soft loss — match the teacher’s softened distribution. This is the KL divergence between the student’s and teacher’s
T-softened outputs, scaled byT²(the scaling keeps the gradient magnitude steady as you changeT). - Hard loss — the ordinary cross-entropy against the true label, so the student stays anchored to ground truth even where the teacher is wrong.
L = α · L_soft(T) + (1 − α) · L_hard
α balances the two; a heavy weight on the soft term (around 0.9) is common, because the soft targets carry the richer signal.
Three flavours of distillation
Response-based is the classic above: the student matches the teacher’s final output distribution (its soft labels). Simple, model-agnostic, and most of what people mean by “distillation.”
Feature-based goes deeper: the student is also nudged to match the teacher’s intermediate activations — the hidden representations a few layers in — not just the final answer. This transfers more, but you must map the (differently shaped) layers between teacher and student. DistilBERT uses a version of this.
Sequence / data distillation is how most small language models are made today. You do not have the teacher’s logits for a closed model like GPT-4 — so instead you have it generate thousands of high-quality answers, and fine-tune the student on that synthetic dataset. The teacher’s “soft” knowledge is baked into the text it produces. Stanford’s Alpaca and the Orca line were built this way; so are countless small open models.
It works — the receipts
| Student | Teacher | Result |
|---|---|---|
| DistilBERT | BERT-base | 40% smaller, 60% faster, ~97% of BERT’s GLUE score |
| DistilGPT-2 | GPT-2 | ~2× faster at close quality |
| Alpaca 7B | text-davinci-003 | instruction-following from 52K distilled examples, for ~$600 |
The headline number to remember: a good distillation keeps roughly 95–97% of the teacher’s quality at a fraction of the size — if the student is large enough and the data is good. Push the student too small and quality falls off a cliff; there is no free lunch, only a very good trade.
Quantization vs distillation
Both shrink a model, but they change completely different things — and that is exactly why you often want both.
| Quantization | Distillation | |
|---|---|---|
| What shrinks | fewer bits per weight | fewer weights (a new model) |
| Architecture | identical | brand-new, smaller |
| Params & FLOPs | unchanged — same math, fewer bytes | genuinely fewer → faster per token |
| Cost to apply | minutes; no training (PTQ) | a full training run + teacher inference over data |
| You need | the model + a few calibration samples | a teacher + a training set + GPU time |
| Typical size cut | 4–8× (FP16 → INT4) | 2–10× (you choose the student) |
| Quality hit | small (≈2–5 pts at INT4) | depends on student size + data |
| Reversible? | yes — reload FP16 | no — it is a different model |
The mental model: quantization makes each weight cheaper to store; distillation makes there be fewer weights to compute. Quantization saves memory and bandwidth; distillation saves actual FLOPs, which is the only thing that makes a model fundamentally faster.
And they compose. The strongest small models are usually distilled first, then quantized:
Distillation and quantization are complementary, not rival. Distill first to remove compute; quantize second to remove memory.
Quiz
Quick check
Next
Speculative Decoding — keep your big model’s quality but get a small model’s speed, by letting a tiny draft model run ahead and having the big one check its work.
Practice this in an interview
All questionsSmaller models win on latency, inference cost, on-device deployment, and fine-tuning feasibility. When trained on high-quality, curated data and aligned for a narrow task, a 7B–13B model can match or exceed a general-purpose 70B+ model on that specific workload while using a fraction of the compute budget.
Small labelled datasets call for a layered strategy: transfer learning from a pretrained backbone, heavy data augmentation, self-supervised pretraining on unlabelled data, and regularisation to prevent the model memorising the few examples it sees.
Larger batches give more accurate gradient estimates and enable higher GPU utilisation, but they tend to converge to sharper minima that generalise worse. Smaller batches introduce gradient noise that acts as implicit regularisation, helping the optimiser escape sharp minima and often finding flatter, better-generalising solutions — at the cost of slower wall-clock training per epoch.
Mixed precision training stores weights and activations in float16 (or bfloat16) for forward/backward passes while keeping a float32 master copy of weights for the update step. This halves memory usage and delivers 2–4x throughput on modern tensor cores, with negligible accuracy loss when used with loss scaling.