One-hot encoding and the curse of high cardinality

A data science team at a mid-size e-commerce company spent three weeks debugging a gradient-boosted model that refused to converge. Memory kept blowing up. Training time stretched from minutes to hours. The feature matrix, when they finally printed its shape, had 14,000 columns for a dataset with 180,000 rows. The culprit was a single preprocessing line that had seemed entirely unremarkable: pd.get_dummies(df['zip_code']).

That one line had turned a column with 12,000 distinct zip codes into 12,000 binary columns, each of which was 1 in exactly one row out of every cluster of rows for that zip. The matrix was 99.99 percent zeros. They had taken a column containing useful geographic signal and buried it under a sparse wall of nothing.

This is the curse of high cardinality. It is worth understanding precisely, because the instinct that produces it — “categories need to be numbers, so I’ll make a column per category” — is not wrong in principle. It just fails spectacularly at scale.

Why categories need to become numbers at all

Machine learning models are, at their core, functions that compute weighted sums, dot products, or distances. None of those operations are defined for the string "New York". The model cannot subtract "California" from "Texas" and extract anything meaningful. So the values must become numbers before any arithmetic can happen.

The naïve approach is to assign integers: 0 for Alabama, 1 for Alaska, 2 for Arizona. This is called label encoding (or ordinal encoding), and it introduces a false ordering. The model will learn that Alaska is halfway between Alabama and Arizona, that California is three times Alabama. That arithmetic is gibberish. The integer encoding has poisoned the feature with a structure that does not exist in the data.

One-hot encoding (OHE) solves this by making the ordering problem disappear. Instead of one column with integers 0 through 49, you get 50 binary columns — one per state — and each row has exactly one 1. The model can learn a separate weight for each state independently, with no implied ordering or distance between them. For a column with 3 or 5 or even 20 distinct values, this is clean, interpretable, and correct.

The problem begins when the cardinality — the number of distinct values — climbs into the hundreds or thousands.

What sparsity does to a model

Picture the transformation geometrically. A categorical column with 3 values maps each row to one of three points in a 3-dimensional space, where each point sits at a corner of a simplex. Every point is equidistant from every other. The geometry is honest: no category is implied to be “closer” to another.

Now take a column with 5,000 values. Each row maps to a corner of a 5,000-dimensional simplex. Each dimension is 0 in 4,999 out of every 5,000 rows. The feature matrix is almost entirely zeros.

Sparsity is not just a memory complaint. It is a statistical one. A model learns weights by observing patterns — by seeing the same feature value appear alongside the same label, repeatedly, across many training examples. A zip code that appears in only four rows of your training set contributes almost no learnable signal. The weight the model assigns to that zip code’s column will be poorly estimated, dominated by noise, and quite possibly overfit to those four specific rows. You are not encoding geographic signal — you are encoding individual examples.

This is the mechanism behind the phrase “blows up and overfits.” The matrix gets large enough to cause memory errors. The model, if it fits at all, memorizes rare categories rather than generalizing from them.

Left: a 3-value column encodes cleanly — every column is useful in every row. Right: a 5,000-value column produces a sparse matrix where 99.98% of entries are zero and most weights are estimated from fewer than five examples.

The threshold is lower than you think

The conventional wisdom says “use OHE for low-cardinality columns.” In practice, “low” means fewer than 20 or 30 values, not fewer than 200. The problem is not purely about memory. It is about the ratio of examples to categories.

A column with 100 distinct values in a dataset with 100,000 rows averages 1,000 examples per category. That is learnable. The same column in a dataset with 500 rows averages 5 examples per category — barely enough to estimate a mean, let alone a stable weight under regularization. The cardinality threshold that matters is relative to dataset size, not absolute. If n_categories / n_rows is larger than roughly 0.01, start thinking about alternatives.

Target encoding: let the label do the compression

Target encoding (also called mean encoding) replaces each category with the mean of the target variable within that category. For a binary classification problem, the zip code "94103" becomes the fraction of rows with zip "94103" where the target was 1. The entire column is compressed into a single float column.

This is powerful for exactly the reason OHE is weak: rare categories get a number that reflects what little data there is about them, and the encoding is dense — one column, no sparsity. The signal is direct: the model can immediately see that certain zips have high fraud rates or high conversion rates without needing to discover it implicitly through weights.

The risk is target leakage. If you compute the mean target over the entire training set and then use that to encode training features, the encoding for a given row is computed using that row’s own label. This is a form of circular reasoning. The model will learn to use a feature that is partly derived from the answer it is trying to predict. The fix is to compute target encodings with cross-validation, encoding each fold using means computed from the other folds, never letting a row see its own label during encoding.

A second risk is that rare categories still produce noisy means. The fix is Bayesian smoothing: blend the category mean toward the global mean, weighted by the count of examples. A zip with 3 examples gets a heavily smoothed estimate; a zip with 5,000 examples keeps its own mean almost exactly. This is the approach used by CatBoost internally, which is why CatBoost handles high-cardinality categoricals without any preprocessing at all.

Feature hashing: cardinality without the matrix

Feature hashing (the hashing trick) maps each category value through a hash function to a fixed-size integer bucket, then uses that bucket index to set a 1 in a fixed-size binary vector. The output always has the same number of columns — say, 256 or 1,024 — regardless of how many distinct values exist in the column.

This is memory-safe and works on categories you have never seen before (new values just hash to some bucket). The cost is hash collisions: two distinct categories can hash to the same bucket and share a weight. In practice, with a bucket size of 512 or 1,024, collisions are rare enough not to matter for most problems. The weights that emerge are less interpretable than either OHE or target encoding — you cannot look at the model and say “zip 94103 has weight 0.37” — but the predictions are often just as good.

Hashing is the right choice when you want a robust, drop-in replacement for OHE on high-cardinality columns with no risk of the encoding failing on new values at inference time. E-commerce product IDs, page URLs, and user agent strings are natural fits.

Embeddings: the continuous geometry of categories

The deepest solution is to learn an embedding — a dense vector representation of each category, typically 4 to 32 dimensions, that is trained jointly with the rest of the model. In a neural network, this means adding an embedding layer: the model looks up a row of a learned weight matrix indexed by the category integer, and that row is treated as the feature vector for that category.

The resulting embeddings are remarkable. Train a word embedding model and you discover that the vector for "king" minus "man" plus "woman" is close to "queen". Train an embedding on zip codes and nearby zips cluster together in the learned space, even though the model was never told anything about geography. The embedding has extracted the latent structure of the category from the data.

This is why large recommendation systems — Netflix, Amazon, Spotify — are almost entirely built on embeddings. User IDs and item IDs, with millions of distinct values, cannot be one-hot encoded. They are embedded. The embedding vectors are the features, and they compress an enormous discrete space into a compact, geometrically meaningful real-valued space.

The catch is that learning good embeddings requires substantial data. A category that appears in fewer than a few hundred rows will have a poorly trained embedding vector. For moderate-cardinality columns with lots of data, embeddings are the most powerful option. For sparse categoricals with rare values, target encoding with smoothing tends to outperform them.

When trees get a free pass

It is worth pausing on the fact that gradient-boosted tree models (XGBoost, LightGBM, CatBoost) occupy a separate conceptual category here. Trees do not compute dot products or weighted sums. They partition the feature space with threshold comparisons. A tree can natively handle a category column encoded as integers without any of the ordering problems that afflict linear models, because the split zip == 94103 can be evaluated directly.

LightGBM and CatBoost have explicit support for categorical features — pass the column type, and the library handles the encoding internally using optimized methods that are roughly equivalent to Bayesian target encoding with cross-validation. For tree-based models, one-hot encoding is often unnecessary, and the sparse matrix it produces is sometimes actively harmful (it makes the tree consider many near-useless splits on nearly-empty columns). The correct advice for trees is: use the library’s native categorical support and let it decide how to encode internally.

The OHE problem is most acute for linear models, logistic regression, and neural networks — models that require a proper dense numeric input and where the weight for each feature has a direct geometric interpretation.

The decision in practice

Here is how a senior practitioner actually works through this:

If the column has fewer than 20 or so distinct values and the training set is large enough that every category appears many times, one-hot encode it. It is clean, interpretable, and the downstream model can reason about individual categories transparently.

If the column has hundreds to low thousands of values, use target encoding with cross-validation and Bayesian smoothing. One float column, dense, directly encodes the relationship between category and label.

If the column has many thousands of values and you need something that will not break on unseen values at inference time, use feature hashing with a bucket size of 512 or 1,024.

If you are building a neural network on top of a high-cardinality ID column (users, products, pages), use an embedding layer and let the model learn the representation jointly.

If you are using a gradient-boosted tree library, pass the column as categorical and let the library do it.

The underlying principle is that encoding is a form of compression. OHE is lossless but produces a matrix too large to be useful. Every alternative trades some interpretability or guarantees for a representation that is compact enough for the model to actually learn from. The right trade depends on your model class, your data size, and your cardinality.

The thing worth internalizing

One-hot encoding is not wrong. It is a correct and elegant solution to the problem of representing discrete identity without imposing false ordering. The mistake is applying it by default to every categorical column without asking what the cardinality is.

The encoding choice is not a preprocessing detail. It determines the shape of the feature space your model operates in. A feature space with 12,000 sparse binary dimensions is not “the same information” as a feature space with one dense float derived from target means. The information content is similar in theory; the learnability is completely different in practice. Sparse high-dimensional spaces are difficult to generalize in, which is the entire content of the phrase “curse of dimensionality.”

The team that blew up their feature matrix with zip codes eventually switched to target encoding with 5-fold cross-validation. Their matrix dropped from 14,000 columns to 140. Training time went from hours to minutes. And because the encoding directly captured the relationship between geography and the label, the model’s accuracy improved. The feature they had buried under 12,000 sparse columns turned out to be one of the three most predictive signals in the dataset. They had been encoding it in a way that made it invisible to the model.

That is the curse of high cardinality: not that the information does not exist, but that a naive encoding hides it behind a wall of zeros.