Feature scaling: the step KNN and gradient descent never forgive

A team building a customer-churn model feeds in two features: account balance (ranging from 0 to 150,000 dollars) and tenure in years (ranging from 0 to 12). They run KNN — k-nearest neighbors, a classifier that predicts a label by polling the k training points geometrically closest to the query point — with k set to 5, and get an accuracy they are happy with. What they do not notice is that tenure is contributing essentially zero information to any prediction. The dollar column ate the geometry whole.

The damage is invisible in the final number. That is what makes this failure mode so persistent.

Distance is a ruler, and a ruler cares about units

When KNN computes the Euclidean distance between two customers — say, customer A with a balance of 80,000 and tenure of 3 years, and customer B with a balance of 80,001 and tenure of 12 years — the gap in balance contributes 1 squared to the distance while the gap in tenure contributes 81 squared. Customer B is nine years more tenured but the model treats them as almost identical to A because the balance difference is negligible.

The mathematics is not wrong. The geometry is faithfully computed. The mistake was upstream: treating a dollar and a year as commensurable units, as if they live on the same natural axis. They do not. The choice of units is arbitrary — you could measure balance in cents or tenure in months — and that arbitrary choice is silently baked into the model’s sense of “nearby.”

This is not a KNN quirk. It is a structural property of any algorithm whose loss or prediction involves distance or inner products without first normalizing the space. Support vector machines (which find the maximum-margin hyperplane separating classes) are sensitive to scale because the margin is measured in feature space. K-means clustering (which assigns points to the nearest centroid and updates centroids iteratively) suffers the same distortion. Principal component analysis (which finds directions of maximum variance) will orient its first principal component almost entirely along the high-variance, high-scale feature. Neural networks using gradient descent will see wildly asymmetric gradients: a small step in weight space produces a large change in loss for the big-numbered feature and a negligible change for the small-numbered one.

Same two customers, same two features. Before scaling the balance axis dwarfs tenure so both points appear almost identical. After scaling the real nine-year tenure gap becomes visible distance.

Two tools, different philosophies

Standardization — subtracting the column mean and dividing by the standard deviation — maps a feature to mean 0 and standard deviation 1. The result has no intrinsic bounds; a value three standard deviations above the mean remains at 3.0 after scaling. Outliers survive but no longer dominate purely because of their unit.

Min-max normalization — subtracting the minimum and dividing by the range — compresses every value to the interval 0 to 1. It is sensitive to outliers in a different way: one very large value shrinks everything else toward zero. If a feature has a genuine hard ceiling (pixel brightness runs from 0 to 255 by construction), min-max makes that semantic boundary explicit. If the distribution is heavy-tailed, standardization is almost always the safer choice.

Neither is universally correct. The decision is a question about the distribution of the feature and the geometry of the model, not about which name sounds more sophisticated.

The models that care and the models that do not

The honest list: KNN, SVM, k-means, PCA, logistic regression trained with gradient descent, neural networks of any depth, ridge and lasso regression (which add a penalty on weight magnitude, making scale matter directly), and any Euclidean-distance-based anomaly detector. These all see the raw feature values through a lens that is sensitive to magnitude.

Trees do not care, and this is worth understanding rather than memorizing. A decision tree (and by extension random forests, gradient-boosted trees, XGBoost) makes splits by choosing a threshold: “is feature X above or below value t?” The threshold adapts to whatever scale X is on. Doubling all values of X just doubles t. The information content of the split is identical. Scale is a monotonic transformation and trees are invariant to monotonic transformations of individual features.

This immunity is one reason tree ensembles dominate tabular competition benchmarks. It is one fewer thing to get wrong.

Gradient descent and the loss landscape

For neural networks the scaling argument is not about distance but about curvature. When features live on different scales, the loss surface — the hypersurface of loss as a function of model weights — becomes elongated along some weight dimensions and compressed along others. Gradient descent (iteratively stepping in the direction of steepest descent of the loss) will oscillate in the steep narrow directions while crawling in the shallow flat ones.

Visualize a bowl that is circular in cross-section: gradient descent walks straight to the minimum. Now squash that bowl into an ellipse. The gradient at every point points diagonally across the ellipse rather than toward the center, and the optimizer bounces off the steep walls repeatedly before converging. That elongation is what unscaled features produce.

The symptom in practice is training that takes far more epochs than expected, or that requires an unusually small learning rate (the step size at each gradient update) to remain stable. Teams that add batch normalization (a layer that re-centers and rescales activations inside the network after each mini-batch) are implicitly solving this problem from the inside. But input scaling is still good hygiene; it ensures the first layer receives a well-conditioned signal before normalization layers have anything to work with.

Unscaled inputs produce an elongated loss bowl; gradient descent bounces off the steep walls. Scaling restores circular symmetry and the optimizer walks directly to the minimum.

The leakage trap that ruins production models

Here is where practitioners get burned even after they understand scaling. The correct procedure is:

Split the dataset into train and test (or train, validation, and test) first.
Fit the scaler — compute the mean and standard deviation, or the minimum and maximum — on the training set only.
Apply that fitted scaler to transform both train and test.

The leakage trap is fitting the scaler on the full dataset before splitting. When you do that, the test set’s statistics — its mean, its extreme values — leak into the training process. The scaler has seen the future. In practice the contamination is often small, but in some distributions (financial data with rare extreme events, medical data with a few outlier patients) it can meaningfully inflate test performance in a way that vanishes in production.

The deeper principle is that a scaler is a model component, not a preprocessing step that happens before modeling. It learns parameters from data. Any learned parameter must be learned only on training data. This is the same reason you do not impute missing values by computing the mean of the full dataset: the mean of the full dataset is not available at inference time when a new row arrives.

In scikit-learn (the standard Python ML library), the Pipeline object encapsulates this correctly: it fits all transformers on the training fold during cross-validation and applies the fitted transformers when scoring the validation fold. Teams that apply scalers outside the pipeline often do not realize they are leaking.

What the number tells you

After scaling, two features with wildly different original units now live in a shared geometric space where a movement of 1.0 along any axis means “one standard deviation away from the mean” (for standardization) or “one full width of the observed range” (for min-max). The model can now meaningfully ask: which feature is actually more informative about the outcome, as opposed to which feature happens to be measured in larger units?

The answer is sometimes surprising. Tenure — a low-variance, small-scale feature — frequently turns out to be among the most powerful predictors of churn once the balance column stops shouting. Scaling is not just a technical hygiene step. It is the act of making the data tell you what it actually knows rather than what its measurement units happen to emphasize.

When to skip it anyway

Interpretability sometimes argues against scaling. If a stakeholder needs to understand a logistic regression coefficient as “each additional thousand dollars of balance reduces churn probability by X percent,” standardized coefficients are harder to explain. You can standardize during training and then back-transform the coefficients for reporting, but that adds a step.

Scaling also does not fix collinearity (when two features are nearly linearly dependent, scaling does not change the fundamental ill-conditioning of the design matrix), does not solve class imbalance, and does not make a noisy feature informative. It is one tool. It solves one problem — the problem of a model being unable to see past the loudest-numbered column.

Get it right and you stop giving the dollar column a megaphone it did not earn. Get it wrong and the model quietly learns a distorted geometry, passes all your offline metrics with a plausible number, and then behaves strangely in production in ways that will take weeks to trace back to a scaler fitted on the wrong data.

The step is unglamorous. That is why it gets skipped. That is why it keeps biting people.