What is a z-score and what is standardization used for in data science?
A z-score expresses how many standard deviations an observation is from the mean of its distribution, converting raw values to a common unitless scale. Standardization — subtracting the mean and dividing by the standard deviation — is essential before algorithms that depend on distances or regularization penalties, because it prevents features with large numeric ranges from dominating those with small ranges.
How to think about it
Define the z-score, link it to the standard normal, then discuss the practical algorithms where standardization is mandatory vs unnecessary — the latter is often overlooked in interviews.
Definition
z = (x - mu) / sigma
For a sample, replace μ with x̄ and σ with s (sample SD). The z-score is dimensionless: regardless of whether X is in dollars, seconds, or kilograms, z is in “standard deviation units.” A z-score of 2 means the observation is two standard deviations above the mean.
Connecting to the standard normal
If X ~ N(μ, σ²), then Z = (X − μ) / σ ~ N(0, 1). This transformation allows you to use a single standard normal table for any Gaussian-distributed variable. For example, P(X > x) = P(Z > z) = 1 - Phi(z) where Φ is the standard normal CDF.
Why algorithms need standardization
Distance-based algorithms (k-NN, k-means, SVM with RBF kernel): Distance metrics aggregate differences across all features. A feature ranging from 0 to 1,000,000 (e.g., income) will completely overwhelm one ranging from 0 to 1 (e.g., a binary flag). After standardization, each feature contributes roughly equally to distances.
Regularized models (lasso, ridge, elastic net): The regularization penalty shrinks coefficients toward zero. If features are on different scales, the penalty is applied unequally — a coefficient for a small-scale feature must be large to have any effect, so it is penalised more. Standardization makes the penalty fair across features.
Gradient descent convergence: Elongated loss surfaces caused by differing feature scales slow convergence and create sensitivity to learning rate. Standardized features produce more isotropic surfaces that converge faster.
When standardization is not needed
- Tree-based models (decision trees, random forests, gradient boosting) make splits on individual feature thresholds — they are scale-invariant by construction.
- Models with interpretable coefficients where you want to preserve the original scale for communication.
- Features that are already on a natural comparable scale.
Min-max scaling vs z-score
Min-max scaling maps features to [0, 1]: x_scaled = (x - x_min) / (x_max - x_min). It is bounded but sensitive to outliers because the min and max can be extreme. Z-score standardization is unbounded but robust to outliers in the scaling parameters (mean and SD are affected by outliers, but less dramatically than min and max).