datarekha
Machine Learning Medium Asked at AirbnbAsked at UberAsked at AmazonAsked at Booking.com

How do you handle high-cardinality categorical features in machine learning?

The short answer

One-hot encoding becomes impractical when a categorical feature has hundreds or thousands of unique levels, producing a sparse matrix that slows training and causes overfitting on rare categories. Better approaches include target encoding with smoothing, frequency encoding, hashing, learned embeddings, or grouping rare categories into an 'Other' bucket, each with different tradeoffs on leakage risk and information retention.

How to think about it

A feature like user_id, product_sku, or postal_code can have millions of unique values. One-hot encoding produces millions of columns; the vast majority are zeros, most categories appear rarely, and the encoder cannot handle unseen categories at inference time.

Strategy 1 — Frequency / count encoding

Replace each category with how many times it appears in the training set (or its relative frequency). Captures popularity without creating new columns. No leakage risk from labels.

freq = X_train["city"].value_counts()
X_train["city_freq"] = X_train["city"].map(freq).fillna(0)
X_test["city_freq"]  = X_test["city"].map(freq).fillna(0)

Strategy 2 — Target encoding with smoothing

Replace each category with the smoothed mean of the target for that category (see dedicated question). Highly predictive but requires strict CV-inside-pipeline discipline to avoid leakage.

Strategy 3 — Hashing (feature hashing)

Hash each category string into a fixed-size integer space (e.g., 256 buckets). Handles unseen categories automatically; requires no pre-fitting. Collisions mean distinct categories may share a bucket, introducing noise.

from sklearn.feature_extraction import FeatureHasher

hasher = FeatureHasher(n_features=256, input_type="string")
X_hashed = hasher.transform(X_train[["city"]].astype(str).to_dict("records"))

Strategy 4 — Grouping rare categories

Replace categories below a frequency threshold with a single "_rare_" token. Reduces the cardinality before any downstream encoding.

threshold = 50
counts = X_train["category"].value_counts()
rare = counts[counts < threshold].index
X_train["category"] = X_train["category"].where(
    ~X_train["category"].isin(rare), other="_rare_"
)

Strategy 5 — Learned embeddings

In neural networks, treat the high-cardinality feature as an embedding lookup. The embedding dimension is typically min(50, (n_categories + 1) // 2). Used extensively in recommendation systems and entity embedding papers.

Decision guide

ApproachWorks with unseen?Leakage riskBest model fit
One-hotNo (fix: handle_unknown)NoneLinear, low cardinality
FrequencyYes (map to 0)NoneAny
Target encodingYes (global mean fallback)High if misusedTrees, linear
HashingYesNoneAny, large scale
EmbeddingsYesNoneNeural networks
Learn it properly Data leakage

Keep practising

All Machine Learning questions

Explore further

Skip to content