Reading a model's mind: sparse autoencoders explained
Sparse autoencoders pull human-readable features out of an LLM's tangled activations — the breakthrough tool of mechanistic interpretability.
We can build a model that writes code and explains quantum mechanics, and yet we cannot straightforwardly answer a simpler question: what is it actually representing inside? A neural network is a giant pile of numbers, and for most of the deep-learning era its internals were a black box. Mechanistic interpretability is the field trying to change that — to reverse-engineer the concepts a model uses, the way a biologist maps what neurons do. In 2024-2026 it got its breakout tool.
The problem: one neuron is not one idea
The naive hope was that individual neurons would each stand for a clean concept — a “dog” neuron, a “Python” neuron. They do not. A single neuron typically fires for a grab-bag of unrelated things, and a single concept is spread across many neurons. The reason is superposition: a model needs to represent vastly more features than it has neurons, so it packs them in as overlapping directions in activation space, like more colors than you have crayons by mixing them. That packing is efficient for the model and a nightmare for anyone trying to read it — the features are real, but they are entangled.
The tool: sparse autoencoders
A sparse autoencoder (SAE) is a small network trained to do one thing: take a model’s tangled internal activations, expand them into a much wider set of features, and force only a few of those features to be active at once. That sparsity constraint is the magic. It pushes the SAE to discover a dictionary of features that each capture one thing — and the features it finds are far more interpretable and monosemantic than the directions you get from other methods, effectively un-mixing the superposition.
From toy models to real ones
The leap that made this famous was scale. Anthropic applied sparse autoencoders to a production model, Claude 3 Sonnet, and pulled out millions of interpretable features — concrete things like a feature for the Golden Gate Bridge, features for code errors, for sycophancy, even for kinds of deception. Crucially, these were not just labels you read off: turning a feature up or down changed the model’s behavior in the predicted way, which is what tells you the feature is causal, not cosmetic. The work continued into 2025 with circuit tracing and attribution graphs that follow how features connect across layers — moving from “what concepts exist” toward “how the model uses them to reach an answer.”
The honest caveats
It is early, and the technique is imperfect. Reconstructing a model’s activations through an SAE is lossy — swapping the SAE’s reconstruction back in causes a 10-40% drop on downstream tasks, so the features do not yet capture everything the model uses. Superposition also gets worse in larger models, making clean, monosemantic features harder to guarantee at frontier scale. Reading a mind, it turns out, is hard even when you have every neuron in front of you.
The takeaway
For most of their history, we judged these models purely by behavior — the ultimate black box. Sparse autoencoders cracked the lid: they show that the concepts inside are real, nameable, and even editable, just tangled by superposition into a form we could not read directly. It does not fully demystify the transformer yet, but it is the strongest evidence so far that we will not be stuck treating these systems as inscrutable forever.