o3-level reasoning on your laptop: how distillation works

A frontier reasoning model can be breathtaking at a hard math proof or a gnarly bug — and breathtakingly expensive to run, with tens or hundreds of billions of parameters and a long, token-hungry thinking process for every query. So here is the question that drove one of 2026’s most useful ideas: could you take that reasoning ability and bottle it into a model small enough to run on a single GPU, or even a laptop?

That is exactly what reasoning distillation does.

Distillation, then reasoning distillation

Classic knowledge distillation is an old idea: train a small student model to imitate a large teacher, so the student inherits much of the teacher’s skill at a fraction of the size. You can read the mechanics in the lesson on model distillation.

Reasoning distillation adds a twist that turns out to matter enormously. Instead of training the student only on the teacher’s final answers, you train it on the teacher’s full chains of thought — the entire step-by-step monologue the reasoning model produces on its way to the answer. The student does not just learn what the answer is; it learns how the teacher got there.

The result that surprised people

When DeepSeek released R1, it also released a set of small dense models distilled from R1’s reasoning traces. The striking finding: the distilled small models outperformed far larger non-reasoning models on math and coding benchmarks. A modest model that had been taught to think beat a much bigger model that only knew how to answer. The reasoning skill, it turned out, was surprisingly transferable — you could move a lot of it from a giant into a dwarf, just by showing the dwarf enough worked examples of the giant thinking out loud.

This is why the 2026 roadmap keeps listing reasoning distillation as the bridge bringing frontier-level intelligence to edge devices. The expensive part — discovering how to reason through hard problems — is done once, by the teacher. The cheap part — imitating that process — is what ships to your phone.

Why a chain of thought is such a good teaching signal

A final answer is a single bit of supervision: right or wrong. A full reasoning trace is a dense one. It shows the student where to start, how to decompose the problem, which dead ends to abandon, and how to check the result. Training on the process gives the student thousands of small lessons per example instead of one — much closer to learning from a worked solution than from an answer key.

The takeaway

Reasoning distillation flips the cost structure of intelligence. The hard, expensive work of learning to think happens once in a giant model; that thinking is then captured as traces and poured into models small enough to run anywhere. It is the single biggest reason that “o3-level reasoning” stopped being a thing that lives only in a data center — and the clearest example of why letting a model think is a skill you can teach, not just a size you can buy.