What is model quantization, and how does it affect quality?

Quantization stores weights and sometimes activations in lower-precision formats to cut memory and speed up inference, ranging from 16-bit (FP16 or BF16) down to INT8 and INT4. Lower precision saves more memory but can degrade accuracy; techniques like calibration, GPTQ, AWQ, and keeping sensitive layers higher-precision minimize the loss.

How would you reduce the cost of serving an ML or LLM model in production without hurting quality?

Work top-down: start at the model layer with quantization, distillation, or routing cheaper models for easy requests, since model choices drive every downstream cost. Then optimize the runtime with batching, caching, and techniques like prompt caching for LLMs, and finally match infrastructure to the load using autoscaling on queue depth and spot or batch capacity. Track cost per token or per prediction alongside latency percentiles and accuracy so optimizations never silently degrade quality.

Why are smaller language models (SLMs) sometimes preferable to larger ones?

Smaller models win on latency, inference cost, on-device deployment, and fine-tuning feasibility. When trained on high-quality, curated data and aligned for a narrow task, a 7B–13B model can match or exceed a general-purpose 70B+ model on that specific workload while using a fraction of the compute budget.

How do you train a deep learning model when you have very little labelled data?

Small labelled datasets call for a layered strategy: transfer learning from a pretrained backbone, heavy data augmentation, self-supervised pretraining on unlabelled data, and regularisation to prevent the model memorising the few examples it sees.

Distillation: Teaching a Small Model to Mimic a Big One — Generative AI

The wall quantization can’t break

Quantization made your 70B model fit in 40 GB. But it is still a 70B model — seventy billion multiply-adds for every single token you generate. On a phone, inside a browser, or under a tight latency budget, “smaller weights” is not enough. You need fewer weights.

That is a different problem with a different answer: train a brand-new, smaller model — fewer layers, fewer parameters — to copy what the big one does. The big model is the teacher; the small one is the student. This is knowledge distillation. The clever part is not the shrinking — it is how the student learns.

Learn from the whole answer, not just the pick

Here is the insight that makes it work. Show a well-trained classifier a photo of a dog. It does not just say “dog” — it outputs a probability for every class:

dog 90% · wolf 8% · cat 1.9% · ship 0.09% · car 0.01%

The hard label — the ground truth — is only “dog.” Train the student on that one-hot answer and you throw almost everything away. But look at what the teacher actually said: a dog is a little like a wolf, barely like a cat, and nothing like a car. Those ratios are real knowledge about how the world is shaped — Geoffrey Hinton called it “dark knowledge.” It is free supervision the student could never get from the bare word “dog.”

So distillation trains the student to match the teacher’s entire probability distribution — its soft labels — not just the winning class.

The temperature knob

There is a catch: a confident teacher’s distribution is too sharp to be useful. Ninety-nine percent on the right answer drowns out the interesting ratios underneath. So we soften it with a temperature T inside the softmax:

p_i = softmax(z_i / T)

where z are the teacher’s raw logits. At T = 1 you get the normal, spiky distribution. Raise T and the probabilities flatten, lifting the tiny ones into view — exactly the “dog is more like a wolf than a car” signal the student needs to learn. That is precisely the move the two charts above show: the same logits go from 0.92-on-dog to a spread where wolf, cat, ship and car all become legible. Teacher and student use the same T during training; at inference the student drops back to T = 1.

The distillation loss

The student is trained against two targets at once:

Soft loss — match the teacher’s softened distribution. This is the KL divergence between the student’s and teacher’s T-softened outputs, scaled by T² (the scaling keeps the gradient magnitude steady as you change T).
Hard loss — the ordinary cross-entropy against the true label, so the student stays anchored to ground truth even where the teacher is wrong.

L = α · L_soft(T)  +  (1 − α) · L_hard

α balances the two; a heavy weight on the soft term (around 0.9) is common, because the soft targets carry the richer signal.

import numpy as np

# Teacher and student logits for 5 classes on one example.
# True label = class 0 ("dog"). The student is close but not identical.
z_teacher = np.array([8.0, 5.5, 3.0, 0.5, 0.0])
z_student = np.array([6.0, 4.0, 2.5, 1.0, 0.5])
true_label, T, alpha = 0, 4.0, 0.9

def softmax(z):
    e = np.exp(z - z.max())
    return e / e.sum()

# Soft targets: the teacher's distribution, softened by temperature T
p_teacher = softmax(z_teacher / T)
p_student = softmax(z_student / T)

# Soft loss = KL(teacher || student), scaled by T^2
kl = np.sum(p_teacher * np.log(p_teacher / p_student))
soft_loss = (T ** 2) * kl

# Hard loss = plain cross-entropy on the true label (student at T=1)
hard_loss = -np.log(softmax(z_student)[true_label])

loss = alpha * soft_loss + (1 - alpha) * hard_loss
print("teacher soft labels (T=4):", np.round(p_teacher, 3))
print(f"soft loss (KL x T^2): {soft_loss:.4f}")
print(f"hard loss (CE):       {hard_loss:.4f}")
print(f"total = 0.9*soft + 0.1*hard = {loss:.4f}")

teacher soft labels (T=4): [0.474 0.254 0.136 0.073 0.064]
soft loss (KL x T^2): 0.4171
hard loss (CE):       0.1624
total = 0.9*soft + 0.1*hard = 0.3916

Notice the proportions. The softened teacher distribution ([0.474, 0.254, …]) is the rich signal — the student is pulled toward the whole shape, not just “class 0.” The soft loss (0.4171) dwarfs the hard loss (0.1624), and weighting it 0.9 is what makes the dark knowledge, not the bare label, do most of the teaching.

Three flavours of distillation

Response-based is the classic above: the student matches the teacher’s final output distribution (its soft labels). Simple, model-agnostic, and most of what people mean by “distillation.”

Feature-based goes deeper: the student is also nudged to match the teacher’s intermediate activations — the hidden representations a few layers in — not just the final answer. This transfers more, but you must map the (differently shaped) layers between teacher and student. DistilBERT uses a version of this.

Sequence / data distillation is how most small language models are made today. You do not have the teacher’s logits for a closed model like GPT-4 — so instead you have it generate thousands of high-quality answers, and fine-tune the student on that synthetic dataset. The teacher’s “soft” knowledge is baked into the text it produces. Stanford’s Alpaca and the Orca line were built this way; so are countless small open models.

The 2026 twist is reasoning-trace distillation: have a strong reasoning model emit its full chain of thought, then train a small student on those input → reasoning → answer triples. The student learns to reason, not just to answer — which is how compact open models picked up surprisingly strong math and code skills. The catch: the student inherits the teacher’s blind spots, so the most important step is a judge-filter that drops low-quality or wrong traces before training.

It works — the receipts

Student	Teacher	Result
DistilBERT	BERT-base	40% smaller, 60% faster, ~97% of BERT’s GLUE score
DistilGPT-2	GPT-2	~2× faster at close quality
Alpaca 7B	text-davinci-003	instruction-following from 52K distilled examples, for ~$600

The headline number to remember: a good distillation keeps roughly 95–97% of the teacher’s quality at a fraction of the size — if the student is large enough and the data is good. Push the student too small and quality falls off a cliff; there is no free lunch, only a very good trade.

Quantization vs distillation

Both shrink a model, but they change completely different things — and that is exactly why you often want both.

	Quantization	Distillation
What shrinks	fewer bits per weight	fewer weights (a new model)
Architecture	identical	brand-new, smaller
Params & FLOPs	unchanged — same math, fewer bytes	genuinely fewer → faster per token
Cost to apply	minutes; no training (PTQ)	a full training run + teacher inference over data
You need	the model + a few calibration samples	a teacher + a training set + GPU time
Typical size cut	4–8× (FP16 → INT4)	2–10× (you choose the student)
Quality hit	small (≈2–5 pts at INT4)	depends on student size + data
Reversible?	yes — reload FP16	no — it is a different model

The mental model: quantization makes each weight cheaper to store; distillation makes there be fewer weights to compute. Quantization saves memory and bandwidth; distillation saves actual FLOPs, which is the only thing that makes a model fundamentally faster.

And they compose. The strongest small models are usually distilled first, then quantized:

Distillation and quantization are complementary, not rival. Distill first to remove compute; quantize second to remove memory.

In one breath

Distillation trains a brand-new, smaller student to copy a big teacher — fewer weights, so genuinely fewer FLOPs and real speed (not just fewer bytes).
The trick is soft labels: match the teacher’s whole probability distribution, whose tiny ratios (“dark knowledge”) are free supervision the bare label can’t give.
A temperature T softens the teacher’s spiky distribution so those ratios become learnable; the loss blends a heavy soft term with a light hard (true-label) term.
Flavours: response-based, feature-based, and sequence/data distillation (fine-tune on the teacher’s generated text) — including 2026’s reasoning-trace distillation, gated by a judge-filter.
Quantization cuts bytes, distillation cuts FLOPs — they compose, so the strongest small models are distilled first, then quantized (≈95–97% of teacher quality at a fraction of the size).

Quiz

Quick check

0/3

Q1Why does distillation soften the teacher's outputs with a high temperature T before the student learns from them?

Q2A team needs to run a 70B model on a 24 GB GPU for a one-off demo next week. They have no training data and no time to train. Quantization or distillation?

Q3Which statement about quantization vs distillation is correct?

Speculative Decoding — keep your big model’s quality but get a small model’s speed, by letting a tiny draft model run ahead and having the big one check its work.

Distillation: Teaching a Small Model to Mimic a Big One

What you'll learn

Before you start

The wall quantization can’t break

Learn from the whole answer, not just the pick

The temperature knob

The distillation loss

Three flavours of distillation

It works — the receipts

Quantization vs distillation

In one breath

Quiz

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further