How do decision trees and gradient boosting libraries handle categorical features natively, and when is label encoding safe?
sklearn trees require numeric input and treat label-encoded integers as ordinal, which imposes a false ordering. One-hot encoding is correct but expensive for high-cardinality features. XGBoost (v2+) and LightGBM support native categorical splits that find the optimal binary partition of categories without ordinal assumptions.
How to think about it
Why label encoding causes problems
Assigning integers 0, 1, 2, … to categories implicitly tells a tree that category_2 is “between” category_1 and category_3. The tree will split at thresholds like x <= 1.5, which only makes sense if the ordering is meaningful (e.g., low/medium/high). For nominal categories (colour, city, product type) this is semantically wrong and forces the tree to use more splits than necessary.
Correct for nominal categories. Creates binary indicator columns; the tree can partition categories into any subset. Downside: d new columns for a feature with d categories, which causes dimensionality explosion for features with thousands of values.
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
pipe = Pipeline([
("ohe", OneHotEncoder(handle_unknown="ignore", sparse_output=False)),
("rf", RandomForestClassifier(n_estimators=200))
])
Native categorical support in LightGBM
LightGBM finds the optimal binary split among all 2^(k-1) partitions of k categories using a Fisher (2k) approximation based on gradient statistics — no ordinal assumption.
import lightgbm as lgb
train_data = lgb.Dataset(X_train, label=y_train,
categorical_feature=["city", "product_type"])
params = {
"objective": "binary",
"num_leaves": 63,
"cat_smooth": 10, # smoothing to handle low-count categories
"min_data_per_group": 5
}
model = lgb.train(params, train_data, num_boost_round=300)
XGBoost (v2+)
import xgboost as xgb
# Mark categoricals as pandas Categorical dtype
X_train["city"] = X_train["city"].astype("category")
model = xgb.XGBClassifier(
enable_categorical=True,
tree_method="hist" # required for categorical support
)
model.fit(X_train, y_train)
When label encoding is safe: only when the ordinal relationship is genuine (e.g., education level: high school=0, bachelor=1, master=2). In that case label encoding is not just safe but preferable — it preserves the ordering in a single column.