How do you handle high-cardinality categorical features in machine learning?
One-hot encoding becomes impractical when a categorical feature has hundreds or thousands of unique levels, producing a sparse matrix that slows training and causes overfitting on rare categories. Better approaches include target encoding with smoothing, frequency encoding, hashing, learned embeddings, or grouping rare categories into an 'Other' bucket, each with different tradeoffs on leakage risk and information retention.
How to think about it
A feature like user_id, product_sku, or postal_code can have millions of unique values. One-hot encoding produces millions of columns; the vast majority are zeros, most categories appear rarely, and the encoder cannot handle unseen categories at inference time.
Strategy 1 — Frequency / count encoding
Replace each category with how many times it appears in the training set (or its relative frequency). Captures popularity without creating new columns. No leakage risk from labels.
freq = X_train["city"].value_counts()
X_train["city_freq"] = X_train["city"].map(freq).fillna(0)
X_test["city_freq"] = X_test["city"].map(freq).fillna(0)
Strategy 2 — Target encoding with smoothing
Replace each category with the smoothed mean of the target for that category (see dedicated question). Highly predictive but requires strict CV-inside-pipeline discipline to avoid leakage.
Strategy 3 — Hashing (feature hashing)
Hash each category string into a fixed-size integer space (e.g., 256 buckets). Handles unseen categories automatically; requires no pre-fitting. Collisions mean distinct categories may share a bucket, introducing noise.
from sklearn.feature_extraction import FeatureHasher
hasher = FeatureHasher(n_features=256, input_type="string")
X_hashed = hasher.transform(X_train[["city"]].astype(str).to_dict("records"))
Strategy 4 — Grouping rare categories
Replace categories below a frequency threshold with a single "_rare_" token. Reduces the cardinality before any downstream encoding.
threshold = 50
counts = X_train["category"].value_counts()
rare = counts[counts < threshold].index
X_train["category"] = X_train["category"].where(
~X_train["category"].isin(rare), other="_rare_"
)
Strategy 5 — Learned embeddings
In neural networks, treat the high-cardinality feature as an embedding lookup. The embedding dimension is typically min(50, (n_categories + 1) // 2). Used extensively in recommendation systems and entity embedding papers.
Decision guide
| Approach | Works with unseen? | Leakage risk | Best model fit |
|---|---|---|---|
| One-hot | No (fix: handle_unknown) | None | Linear, low cardinality |
| Frequency | Yes (map to 0) | None | Any |
| Target encoding | Yes (global mean fallback) | High if misused | Trees, linear |
| Hashing | Yes | None | Any, large scale |
| Embeddings | Yes | None | Neural networks |