datarekha

Distillation: Teaching a Small Model to Mimic a Big One

Quantization shrinks a model's weights. Distillation trains a brand-new, smaller model to copy a big one's behaviour — fewer layers, genuinely faster. How it works, and when to pick it over quantization.

9 min read Advanced Generative AI Lesson 17 of 24

What you'll learn

  • How a small student learns from a large teacher — soft labels, temperature, and 'dark knowledge'
  • The distillation loss: matching the teacher's whole probability distribution, not just its top answer
  • Quantization vs distillation — which compression strategy to use, and why the best models use both

Before you start

The wall quantization can’t break

Quantization made your 70B model fit in 40 GB. But it is still a 70B model — seventy billion multiply-adds for every single token you generate. On a phone, inside a browser, or under a tight latency budget, “smaller weights” is not enough. You need fewer weights.

That is a different problem with a different answer: train a brand-new, smaller model — fewer layers, fewer parameters — to copy what the big one does. The big model is the teacher; the small one is the student. This is knowledge distillation. The clever part is not the shrinking — it is how the student learns.


Learn from the whole answer, not just the pick

Here is the insight that makes it work. Show a well-trained classifier a photo of a dog. It does not just say “dog” — it outputs a probability for every class:

dog 90% · wolf 8% · cat 1.9% · ship 0.09% · car 0.01%

The hard label — the ground truth — is only “dog.” Train the student on that one-hot answer and you throw almost everything away. But look at what the teacher actually said: a dog is a little like a wolf, barely like a cat, and nothing like a car. Those ratios are real knowledge about how the world is shaped — Geoffrey Hinton called it “dark knowledge.” It is free supervision the student could never get from the bare word “dog.”

So distillation trains the student to match the teacher’s entire probability distribution — its soft labels — not just the winning class.


The temperature knob

There is a catch: a confident teacher’s distribution is too sharp to be useful. Ninety-nine percent on the right answer drowns out the interesting ratios underneath. So we soften it with a temperature T inside the softmax:

p_i = softmax(z_i / T)

where z are the teacher’s raw logits. At T = 1 you get the normal, spiky distribution. Raise T and the probabilities flatten, lifting the tiny ones into view — exactly the “dog is more like a wolf than a car” signal the student needs to learn. Drag the slider above and watch the dark knowledge appear. Teacher and student use the same T during training; at inference the student drops back to T = 1.


The distillation loss

The student is trained against two targets at once:

  • Soft loss — match the teacher’s softened distribution. This is the KL divergence between the student’s and teacher’s T-softened outputs, scaled by (the scaling keeps the gradient magnitude steady as you change T).
  • Hard loss — the ordinary cross-entropy against the true label, so the student stays anchored to ground truth even where the teacher is wrong.
L = α · L_soft(T)  +  (1 − α) · L_hard

α balances the two; a heavy weight on the soft term (around 0.9) is common, because the soft targets carry the richer signal.


Three flavours of distillation

Response-based is the classic above: the student matches the teacher’s final output distribution (its soft labels). Simple, model-agnostic, and most of what people mean by “distillation.”

Feature-based goes deeper: the student is also nudged to match the teacher’s intermediate activations — the hidden representations a few layers in — not just the final answer. This transfers more, but you must map the (differently shaped) layers between teacher and student. DistilBERT uses a version of this.

Sequence / data distillation is how most small language models are made today. You do not have the teacher’s logits for a closed model like GPT-4 — so instead you have it generate thousands of high-quality answers, and fine-tune the student on that synthetic dataset. The teacher’s “soft” knowledge is baked into the text it produces. Stanford’s Alpaca and the Orca line were built this way; so are countless small open models.


It works — the receipts

StudentTeacherResult
DistilBERTBERT-base40% smaller, 60% faster, ~97% of BERT’s GLUE score
DistilGPT-2GPT-2~2× faster at close quality
Alpaca 7Btext-davinci-003instruction-following from 52K distilled examples, for ~$600

The headline number to remember: a good distillation keeps roughly 95–97% of the teacher’s quality at a fraction of the size — if the student is large enough and the data is good. Push the student too small and quality falls off a cliff; there is no free lunch, only a very good trade.


Quantization vs distillation

Both shrink a model, but they change completely different things — and that is exactly why you often want both.

QuantizationDistillation
What shrinksfewer bits per weightfewer weights (a new model)
Architectureidenticalbrand-new, smaller
Params & FLOPsunchanged — same math, fewer bytesgenuinely fewer → faster per token
Cost to applyminutes; no training (PTQ)a full training run + teacher inference over data
You needthe model + a few calibration samplesa teacher + a training set + GPU time
Typical size cut4–8× (FP16 → INT4)2–10× (you choose the student)
Quality hitsmall (≈2–5 pts at INT4)depends on student size + data
Reversible?yes — reload FP16no — it is a different model

The mental model: quantization makes each weight cheaper to store; distillation makes there be fewer weights to compute. Quantization saves memory and bandwidth; distillation saves actual FLOPs, which is the only thing that makes a model fundamentally faster.

And they compose. The strongest small models are usually distilled first, then quantized:

70B teacherFP16 · 140 GBslow, hugedistillfewer FLOPs7B studentFP16 · 14 GB~10× fewer FLOPsquantizefewer bits7B · INT43.5 GBfast AND tinyDistill to cut FLOPs, then quantize to cut bytes40× smaller than the teacher, and faster per token — not just smaller

Distillation and quantization are complementary, not rival. Distill first to remove compute; quantize second to remove memory.


Quiz

Quick check

0/3
Q1Why does distillation soften the teacher's outputs with a high temperature T before the student learns from them?
Q2A team needs to run a 70B model on a 24 GB GPU for a one-off demo next week. They have no training data and no time to train. Quantization or distillation?
Q3Which statement about quantization vs distillation is correct?

Next

Speculative Decoding — keep your big model’s quality but get a small model’s speed, by letting a tiny draft model run ahead and having the big one check its work.

Practice this in an interview

All questions
Why are smaller language models (SLMs) sometimes preferable to larger ones?

Smaller models win on latency, inference cost, on-device deployment, and fine-tuning feasibility. When trained on high-quality, curated data and aligned for a narrow task, a 7B–13B model can match or exceed a general-purpose 70B+ model on that specific workload while using a fraction of the compute budget.

How do you train a deep learning model when you have very little labelled data?

Small labelled datasets call for a layered strategy: transfer learning from a pretrained backbone, heavy data augmentation, self-supervised pretraining on unlabelled data, and regularisation to prevent the model memorising the few examples it sees.

How does batch size affect training — speed, convergence, and generalisation?

Larger batches give more accurate gradient estimates and enable higher GPU utilisation, but they tend to converge to sharper minima that generalise worse. Smaller batches introduce gradient noise that acts as implicit regularisation, helping the optimiser escape sharp minima and often finding flatter, better-generalising solutions — at the cost of slower wall-clock training per epoch.

What is mixed precision training and why does it matter?

Mixed precision training stores weights and activations in float16 (or bfloat16) for forward/backward passes while keeping a float32 master copy of weights for the update step. This halves memory usage and delivers 2–4x throughput on modern tensor cores, with negligible accuracy loss when used with loss scaling.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content