What is feature engineering, and can you walk through how you'd engineer features to improve a model?

Feature engineering is creating, transforming, or selecting input variables so a model can capture patterns more easily. Common techniques include scaling, encoding categoricals, binning, interaction and ratio features, date/time decomposition, and domain-derived aggregates. It often matters more than the choice of algorithm because models can only learn from the signal present in their inputs.

Which models require feature scaling and which don't, and why?

Distance-based and gradient-based models (KNN, K-means, SVM, PCA, linear/logistic regression with regularization, neural networks) need scaling because they're sensitive to feature magnitudes. Tree-based models (decision trees, random forests, gradient boosting) are scale-invariant because they split on thresholds per feature. Standardization and min-max scaling are the usual choices, fit on training data only.

What is feature leakage and how do you prevent it during feature engineering and preprocessing?

Feature leakage occurs when information from the test set or from the future leaks into training features, making a model appear more accurate than it will be in production. It arises from fitting preprocessing steps on the full dataset, using post-event information as a predictor, or computing aggregates across train-test boundaries. Prevention requires strict pipeline discipline: all stateful transformations must be fit only on training data.

How do you handle high-cardinality categorical features in machine learning?

One-hot encoding becomes impractical when a categorical feature has hundreds or thousands of unique levels, producing a sparse matrix that slows training and causes overfitting on rare categories. Better approaches include target encoding with smoothing, frequency encoding, hashing, learned embeddings, or grouping rare categories into an 'Other' bucket, each with different tradeoffs on leakage risk and information retention.

Feature engineering & encoding — Machine Learning

Here’s the open secret of tabular machine learning: the algorithm is rarely what wins. Swap XGBoost for LightGBM and your score barely moves. But change how you represent the data — how you encode a category, whether you scale, what features you build — and the score can jump. Survey the people who actually win Kaggle and the refrain is the same: feature engineering, not algorithm choice, is what separates the top tabular solutions. The model is a commodity. The features are the edge.

Encoding categoricals — the daily decision

Models eat numbers, but data is full of categories (city, product, device). Three ways to turn a category into numbers, and the right one depends on cardinality and the model:

One-hot — one 0/1 column per category. Perfect for low cardinality (device ∈ {ios, android, web}). For high cardinality it explodes the feature count into a sparse mess.
Ordinal — map each category to an integer. Only correct when the category is genuinely ordered (size ∈ {S, M, L}). For unordered categories it invents a false ranking a linear model takes literally.
Target encoding — replace each category with the mean target for that category. Compact and powerful for high cardinality (zip_code, user_id) — but it peeks at the label, so it must be fit inside cross-validation or it leaks.

Here is the same city column under all three encodings — watch how differently each one hands the category to a model (target y = [1, 0, 1, 0]):

row	`city`	one-hot → `(Paris, Tokyo, Cairo)`	ordinal	target = mean of `y`
1	Paris	`1, 0, 0`	0	1.00
2	Tokyo	`0, 1, 0`	1	0.00
3	Paris	`1, 0, 0`	0	1.00
4	Cairo	`0, 0, 1`	2	0.00

One-hot keeps categories independent — no false order, but one column each. Ordinal is a single column, yet it tells a linear model Cairo (2) > Paris (0), a ranking that isn’t real. Target is one column carrying real signal — and exactly why it must be cross-fit: those 1.00 / 0.00 values were computed from the very labels you are trying to predict, so fitting them on data a fold will later be tested on leaks the answer.

Scaling numerics — matters, except when it doesn’t

Should you standardize features to mean-0, variance-1? It depends entirely on the model:

Linear models, SVMs, k-NN, neural nets — yes. They compare feature magnitudes directly, so an unscaled income (tens of thousands) drowns out age (tens). Standardize or min-max scale.
Trees and tree ensembles (random forest, XGBoost) — no effect. A tree splits on a threshold (income > 50000?), so multiplying a feature by 1000 changes nothing. Don’t bother.

The right way: ColumnTransformer + Pipeline

The professional pattern applies different transforms to different columns and bundles everything into one Pipeline — so the exact same transformation is re-fit inside each CV fold and re-applied at inference. This is what makes feature engineering leak-proof.

import numpy as np, pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

rng = np.random.default_rng(0); n = 600
df = pd.DataFrame({
    "device": rng.choice(["ios","android","web"], n),
    "age":    rng.integers(18, 70, n),
    "income": rng.gamma(2, 30000, n).round(0),
})
df["buy"] = ((df.device=="ios") + (df.age<35) + (df.income>70000) + rng.normal(0,0.8,n) > 1.2).astype(int)
X, y = df[["device","age","income"]], df["buy"]

# Different transforms per column, all inside ONE pipeline:
pre = ColumnTransformer([
    ("cat", OneHotEncoder(handle_unknown="ignore"), ["device"]),
    ("num", StandardScaler(), ["age", "income"]),   # scaling helps the linear model
])
pipe = Pipeline([("pre", pre), ("clf", LogisticRegression(max_iter=500))])

# The transforms are re-fit inside each fold — no leakage.
scores = cross_val_score(pipe, X, y, cv=5, scoring="roc_auc")
print(f"5-fold AUC: {scores.mean():.3f} +/- {scores.std():.3f}")

Beyond encoding — the features that actually win

Encoding gets the data in; these build the edge:

Aggregations (groupby). “Average spend per user,” “transactions in the last 7 days.” Group-level statistics are the single most powerful tabular feature family — they inject context a single row can’t see.
Interactions. debt / income, price × quantity. Ratios and products capture relationships the model would otherwise have to discover.
Datetime features. From a timestamp, extract day-of-week, hour, is-weekend, days-since-event. A raw timestamp is nearly useless; its parts are gold.
Binning & transforms. Bucketing a skewed feature, or log(income), can linearize a relationship for linear models.

In one breath

On tabular data the algorithm is a commodity; the features are the edge — how you represent the data moves the score more than the model choice.
Encode categoricals by cardinality + model: one-hot (low cardinality), ordinal (only genuinely ordered categories), target encoding (high cardinality — compact, but peeks at the label, so it must be cross-fit).
Scale numerics for magnitude-comparing models (linear, SVM, k-NN, neural nets); skip it for trees, whose threshold splits are scale-invariant.
Wire every transform through a ColumnTransformer + Pipeline so encoders/scalers re-fit inside each CV fold — fitting on the full data before CV leaks and inflates the score (target encoding is the worst offender).
The features that actually win: aggregations (groupby), interactions (ratios/products), datetime parts, and binning/log transforms — and for modern trees, prefer those over heavy encoding.

Quick check

0/3

Q1You have a `zip_code` feature with 5,000 distinct values, feeding a logistic regression. Best encoding?

Q2Does standardizing (scaling) features help a random forest?

Q3Why wrap encoders and scalers in a Pipeline / ColumnTransformer instead of transforming the whole dataset up front?

Now that your features are honest, you need to read whether the model is underfitting or overfitting them — bias–variance & learning curves — and validate it without leaking, in train/val/test & CV.

Feature engineering & encoding

What you'll learn

Before you start

Encoding categoricals — the daily decision

Scaling numerics — matters, except when it doesn’t

The right way: ColumnTransformer + Pipeline

Beyond encoding — the features that actually win

In one breath

Quick check

Quick check

Next

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further