datarekha

Feature engineering & encoding

On tabular data the model is a commodity — the features are the edge. Encoding categoricals, scaling numerics, interactions and aggregations, all wired through a leak-proof Pipeline.

8 min read Beginner Machine Learning Lesson 3 of 33

What you'll learn

  • How to encode categoricals — one-hot, ordinal, and target encoding (and when each fits)
  • When scaling matters (linear/distance models) and when it doesn't (trees)
  • Wiring transforms through a ColumnTransformer + Pipeline so nothing leaks

Before you start

Here’s the open secret of tabular machine learning: the algorithm is rarely what wins. Swap XGBoost for LightGBM and your score barely moves. But change how you represent the data — how you encode a category, whether you scale, what features you build — and the score can jump. Survey the people who actually win Kaggle and the refrain is the same: feature engineering, not algorithm choice, is what separates the top tabular solutions. The model is a commodity. The features are the edge.

Encoding categoricals — the daily decision

Models eat numbers, but data is full of categories (city, product, device). Three ways to turn a category into numbers, and the right one depends on cardinality and the model:

  • One-hot — one 0/1 column per category. Perfect for low cardinality (device ∈ {ios, android, web}). For high cardinality it explodes the feature count into a sparse mess.
  • Ordinal — map each category to an integer. Only correct when the category is genuinely ordered (size ∈ {S, M, L}). For unordered categories it invents a false ranking a linear model takes literally.
  • Target encoding — replace each category with the mean target for that category. Compact and powerful for high cardinality (zip_code, user_id) — but it peeks at the label, so it must be fit inside cross-validation or it leaks.

Try the same table under each encoding, with a linear vs tree model:

Scaling numerics — matters, except when it doesn’t

Should you standardize features to mean-0, variance-1? It depends entirely on the model:

  • Linear models, SVMs, k-NN, neural netsyes. They compare feature magnitudes directly, so an unscaled income (tens of thousands) drowns out age (tens). Standardize or min-max scale.
  • Trees and tree ensembles (random forest, XGBoost) — no effect. A tree splits on a threshold (income > 50000?), so multiplying a feature by 1000 changes nothing. Don’t bother.

The right way: ColumnTransformer + Pipeline

The professional pattern applies different transforms to different columns and bundles everything into one Pipeline — so the exact same transformation is re-fit inside each CV fold and re-applied at inference. This is what makes feature engineering leak-proof.

Beyond encoding — the features that actually win

Encoding gets the data in; these build the edge:

  • Aggregations (groupby). “Average spend per user,” “transactions in the last 7 days.” Group-level statistics are the single most powerful tabular feature family — they inject context a single row can’t see.
  • Interactions. debt / income, price × quantity. Ratios and products capture relationships the model would otherwise have to discover.
  • Datetime features. From a timestamp, extract day-of-week, hour, is-weekend, days-since-event. A raw timestamp is nearly useless; its parts are gold.
  • Binning & transforms. Bucketing a skewed feature, or log(income), can linearize a relationship for linear models.

Quick check

Quick check

0/3
Q1You have a `zip_code` feature with 5,000 distinct values, feeding a logistic regression. Best encoding?
Q2Does standardizing (scaling) features help a random forest?
Q3Why wrap encoders and scalers in a Pipeline / ColumnTransformer instead of transforming the whole dataset up front?

Next

Now that your features are honest, you need to read whether the model is underfitting or overfitting them — bias–variance & learning curves — and validate it without leaking, in train/val/test & CV.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Practice this in an interview

All questions
What is feature engineering, and can you walk through how you'd engineer features to improve a model?

Feature engineering is creating, transforming, or selecting input variables so a model can capture patterns more easily. Common techniques include scaling, encoding categoricals, binning, interaction and ratio features, date/time decomposition, and domain-derived aggregates. It often matters more than the choice of algorithm because models can only learn from the signal present in their inputs.

What is feature leakage and how do you prevent it during feature engineering and preprocessing?

Feature leakage occurs when information from the test set or from the future leaks into training features, making a model appear more accurate than it will be in production. It arises from fitting preprocessing steps on the full dataset, using post-event information as a predictor, or computing aggregates across train-test boundaries. Prevention requires strict pipeline discipline: all stateful transformations must be fit only on training data.

How do you handle high-cardinality categorical features in machine learning?

One-hot encoding becomes impractical when a categorical feature has hundreds or thousands of unique levels, producing a sparse matrix that slows training and causes overfitting on rare categories. Better approaches include target encoding with smoothing, frequency encoding, hashing, learned embeddings, or grouping rare categories into an 'Other' bucket, each with different tradeoffs on leakage risk and information retention.

Which models require feature scaling and which don't, and why?

Distance-based and gradient-based models (KNN, K-means, SVM, PCA, linear/logistic regression with regularization, neural networks) need scaling because they're sensitive to feature magnitudes. Tree-based models (decision trees, random forests, gradient boosting) are scale-invariant because they split on thresholds per feature. Standardization and min-max scaling are the usual choices, fit on training data only.

Related lessons

Explore further

Skip to content