datarekha
Machine Learning Medium Asked at AmazonAsked at AirbnbAsked at UberAsked at Google

What is target encoding, when is it better than one-hot encoding, and how does it cause data leakage?

The short answer

Target encoding replaces each category with the mean of the target variable for that category, reducing dimensionality on high-cardinality features. Naively computed on the full training set, the encoded value for a row contains information from that row's own label, introducing leakage that makes cross-validation scores look better than real-world performance.

How to think about it

Target encoding (also called mean encoding) is the go-to for high-cardinality categorical features like postal codes, product IDs, or user IDs where one-hot encoding would produce thousands of sparse columns.

The idea

For each category level, replace the category string with the mean (or any aggregate) of the target across all training rows that share that level:

city = "Mumbai"  →  mean(sale_price | city = "Mumbai")

A linear model can now use a single numeric column instead of thousands of binary columns, and trees can split on it efficiently.

Why naive mean encoding leaks

If you compute mean(y | category) on the full training set and then use that column to train a model, each row’s encoded value was computed using that row’s own label. The model learns a feature that is partially derived from the target it is trying to predict — a direct form of leakage. Cross-validation scores will be inflated because the leak is present in both folds.

The correct approach: encode inside CV folds

import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from category_encoders import TargetEncoder
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge

# category_encoders handles leave-one-out or k-fold internally
pipe = Pipeline([
    ("te", TargetEncoder(cols=["city"], smoothing=10)),
    ("reg", Ridge()),
])

scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring="r2")

The TargetEncoder inside the pipeline is re-fit on each training fold during cross_val_score, so no fold’s test rows contaminate their own encoding.

Smoothing

For categories with few observations, the category mean is noisy. Smoothing blends the category mean toward the global mean proportionally to sample size:

encoded = (n_cat * mean_cat + k * mean_global) / (n_cat + k)

where k is a smoothing factor. This prevents rare categories from getting extreme encoded values.

When to prefer target encoding over one-hot

  • Feature has more than ~15–20 levels and the model is linear or gradient-boosted.
  • Memory or inference latency is constrained (fewer columns).
  • The categorical variable has a genuine relationship with the target (required for the encoding to be meaningful).
Learn it properly Data leakage

Keep practising

All Machine Learning questions

Explore further

Skip to content