Machine Learning Medium Asked at AirbnbAsked at UberAsked at AmazonAsked at Booking.com

How do you handle high-cardinality categorical features in machine learning?

For Data Scientist ML Engineer AI / LLM Engineer

The short answer

One-hot encoding becomes impractical when a categorical feature has hundreds or thousands of unique levels, producing a sparse matrix that slows training and causes overfitting on rare categories. Better approaches include target encoding with smoothing, frequency encoding, hashing, learned embeddings, or grouping rare categories into an 'Other' bucket, each with different tradeoffs on leakage risk and information retention.

How to think about it

A feature like user_id, product_sku, or postal_code can have millions of unique values. One-hot encoding produces millions of columns; the vast majority are zeros, most categories appear rarely, and the encoder cannot handle unseen categories at inference time.

Strategy 1 — Frequency / count encoding

Replace each category with how many times it appears in the training set (or its relative frequency). Captures popularity without creating new columns. No leakage risk from labels.

freq = X_train["city"].value_counts()
X_train["city_freq"] = X_train["city"].map(freq).fillna(0)
X_test["city_freq"]  = X_test["city"].map(freq).fillna(0)

Strategy 2 — Target encoding with smoothing

Replace each category with the smoothed mean of the target for that category (see dedicated question). Highly predictive but requires strict CV-inside-pipeline discipline to avoid leakage.

Strategy 3 — Hashing (feature hashing)

Hash each category string into a fixed-size integer space (e.g., 256 buckets). Handles unseen categories automatically; requires no pre-fitting. Collisions mean distinct categories may share a bucket, introducing noise.

from sklearn.feature_extraction import FeatureHasher

hasher = FeatureHasher(n_features=256, input_type="string")
X_hashed = hasher.transform(X_train[["city"]].astype(str).to_dict("records"))

Strategy 4 — Grouping rare categories

Replace categories below a frequency threshold with a single "_rare_" token. Reduces the cardinality before any downstream encoding.

threshold = 50
counts = X_train["category"].value_counts()
rare = counts[counts < threshold].index
X_train["category"] = X_train["category"].where(
    ~X_train["category"].isin(rare), other="_rare_"
)

Strategy 5 — Learned embeddings

In neural networks, treat the high-cardinality feature as an embedding lookup. The embedding dimension is typically min(50, (n_categories + 1) // 2). Used extensively in recommendation systems and entity embedding papers.

Decision guide

Approach	Works with unseen?	Leakage risk	Best model fit
One-hot	No (fix: `handle_unknown`)	None	Linear, low cardinality
Frequency	Yes (map to 0)	None	Any
Target encoding	Yes (global mean fallback)	High if misused	Trees, linear
Hashing	Yes	None	Any, large scale
Embeddings	Yes	None	Neural networks

Learn it properly Data leakage