Mean, Median, Mode & z-scores
Summarise a dataset in a few numbers: centre with mean/median/mode, spread with variance, and standardise with the z-score that ML preprocessing relies on.
What you'll learn
- Mean, median, mode, and range as measures of centre and spread
- Population variance (divide by n) vs sample variance (divide by n-1, Bessel's correction)
- Standard deviation and the z-score = (x - mu)/sigma standardization
- Why the mean is sensitive to outliers while the median is robust
Before you start
“What’s the typical income in this town?” — that one question already needs two answers, because the mean and the median can be very different numbers, and which one is honest depends on whether one billionaire moved in. Descriptive statistics is the small toolkit for that: a couple of ways to say “where is the centre” (mean, median, mode), a couple for “how spread out” (variance, standard deviation), and at the end a tiny rescaling — the z-score — that turns “I got 80 on this test” into “I was 2 SDs above the class mean.”
These are cheap marks on the exam, and the z-score in particular is the exact move ML preprocessing makes when it standardises features.
Measures of centre
- Mean
μ = (Σ x) / n— the arithmetic average, the balance point of the data. - Median — sort the values; the median is the middle one (average of the two
middle values if
nis even). - Mode — the most frequently occurring value; a dataset can have one, several, or no mode, and it is the only centre that works for categorical data.
- Range —
max − min, the crudest measure of spread.
For the dataset 2, 4, 4, 6, 9: mean = 25/5 = 5, median = 4 (middle of five
sorted values), mode = 4 (it appears twice), range = 9 − 2 = 7.
Spread: variance and standard deviation
Variance measures the average squared distance from the mean. There are two versions, and the divisor is the whole exam trap:
- Population variance divides by
n:σ² = (Σ (x − μ)²) / n. - Sample variance divides by
n − 1:s² = (Σ (x − x̄)²) / (n − 1).
That n − 1 is Bessel’s correction — using the sample mean (computed from the
same data) slightly understates spread, so dividing by n − 1 instead of n
corrects the bias when you only have a sample. Standard deviation is just the
square root of variance (σ or s), back in the original units.
The z-score — standardization
The z-score z = (x − μ) / σ answers: how many standard deviations does a value
x sit above (positive) or below (negative) the mean? It strips away the units and
scale, which is exactly why standardization is a staple of ML preprocessing — it
puts every feature on the same footing. A z-score of 0 is exactly average; +2 means
two SDs above the mean.
How GATE asks this
Reliably a quick NAT: a value, a mean, and a standard deviation are given, and you
compute the z-score to a few decimals — or you are given a small dataset and asked for
its mean/median/variance. MCQs probe the concepts: which divisor is the sample
variance (n − 1), or which measure of centre is robust to outliers (the median).
z-score normalization appeared in GATE DA 2024.
Worked example — a real 2024 question
A data value of 106000 comes from a distribution with mean
μ = 96000and standard deviationσ = 21000. Find its z-score.
Plug straight into the formula:
z = (x − μ) / σ
= (106000 − 96000) / 21000
= 10000 / 21000
≈ 0.476
So z ≈ 0.476 — the value sits roughly half a standard deviation above the mean.
This is a real GATE DA 2024 question, and 0.476 is the verified answer. Note how
the large raw numbers collapse to a clean, unit-free score once standardized.
Quick check
Quick check
Practice this in an interview
All questionsA z-score expresses how many standard deviations an observation is from the mean of its distribution, converting raw values to a common unitless scale. Standardization — subtracting the mean and dividing by the standard deviation — is essential before algorithms that depend on distances or regularization penalties, because it prevents features with large numeric ranges from dominating those with small ranges.
The mean is distorted by skewness and outliers, masks multimodality, and can describe a value that no individual in the dataset actually holds. Skewed, heavy-tailed, or multimodal distributions almost always require the median, percentiles, or the full distributional picture rather than the mean.
Mean is optimal for symmetric, outlier-free data; median is the go-to for skewed distributions or when outliers are real rather than errors; mode is the only sensible average for nominal/categorical data. Robustness is a formal concept — the median's breakdown point is 50%, meaning half the data can be corrupted before it fails, while the mean's breakdown point is essentially 0%.
Right-skewed features (long tail on the right) concentrate most values near zero while a few extreme values pull the mean up, which distorts distance-based models and linear regression. Common fixes are log, square-root, or Box-Cox transformations that compress the tail and make the distribution closer to normal, improving model convergence and reducing the undue influence of large values.