Expectation, Variance & SD
Expectation is the long-run average of a random variable; variance and SD measure how far it spreads. The summary numbers every distribution and ML lesson leans on.
What you'll learn
- Expectation E[X] = Σ x·p(x) — the probability-weighted average
- Linearity: E[aX + b] = aE[X] + b, always
- Variance Var(X) = E[X²] − (E[X])², and Var(aX + b) = a²·Var(X)
- Standard deviation SD = √Var, back in the units of X
- Nonlinearity traps: E[g(X)] ≠ g(E[X]) in general, e.g. E[1/X] ≠ 1/E[X]
Before you start
A whole PMF table is more than anyone wants to carry around. Most of the time you only need two numbers: roughly where the variable sits, and roughly how much it swings. Those are the expectation (the average value if you repeated the experiment forever) and the variance (the average squared distance from that average). Take the square root of variance and you’re back in the original units — the standard deviation.
Pretty much every later lesson — binomial, Poisson, model error, the central
limit theorem — is “what’s E[X] and Var(X) for this distribution?” GATE
turns that into a NAT almost every year.
The three formulas
For a discrete random variable X with PMF p(x):
Two algebra rules turn these into quick answers without rebuilding sums:
- Linearity of expectation —
E[aX + b] = aE[X] + b. Scaling and shifting the variable scales and shifts its mean, exactly. This holds always, even when variables are dependent. - Variance under shift and scale —
Var(aX + b) = a²·Var(X). A constant shiftbmoves every value equally, so it changes the mean but not the spread —bvanishes. A scaleastretches deviations, and since variance is squared distance it picks upa².
The variance formula itself is worth saying in words: the mean of the squares minus the
square of the mean, Var(X) = E[X²] − (E[X])². The two pieces are different
computations — E[X²] weights x² by p(x); (E[X])² squares the single number E[X].
A handy fact for the geometric setting: if each independent trial succeeds with
probability p, the expected number of trials until the first success is 1/p. For
a fair coin, p = 1/2, so you expect 1/(1/2) = 2 tosses to see the first head.
How GATE asks this
Overwhelmingly a NAT: a small PMF (or a fair die / coin) is given and you compute
E[X], Var(X), or SD to a few decimals. The reliable route is a two-row table —
one row for x·p(x) summing to E[X], one for x²·p(x) summing to E[X²] — then
Var = E[X²] − (E[X])². The occasional MCQ tests the identities instead: which of
E[aX+b] = aE[X]+b, Var(X+c) = Var(X), E[X²] = (E[X])² are always true.
Worked example — a fair six-sided die
A fair die shows 1–6, each with probability
1/6. FindE[X],E[X²],Var(X), andSD.
E[X] = (1+2+3+4+5+6)/6 = 21/6 = 3.5
E[X²] = (1+4+9+16+25+36)/6 = 91/6 ≈ 15.1667
Var(X) = E[X²] − (E[X])²
= 91/6 − 3.5²
= 15.1667 − 12.25
= 2.9167 (exactly 35/12)
SD = √2.9167 ≈ 1.708
Note the order: square each face and average for E[X²] = 15.1667, then subtract the
square of the mean 3.5² = 12.25. Subtracting first or squaring the wrong quantity is
the usual slip. As a second mini-example, the expected number of fair-coin tosses to get
the first head is 1/(1/2) = 2.
Quick check
Quick check
Practice this in an interview
All questionsExpected value is the probability-weighted average outcome of a random variable; variance measures average squared deviation from that mean. Both are linear/additive in specific ways — knowing these rules prevents algebraic mistakes under interview pressure.
Variance is the average squared deviation from the mean; standard deviation is its square root and lives in the same units as the data. Variance is mathematically tractable — variances of independent variables add — while standard deviation is interpretable as a typical distance from the mean.
Standard deviation measures the spread of individual observations around the population mean. Standard error measures the spread of sample means around the true mean — it equals the standard deviation divided by the square root of the sample size, so it shrinks as the sample grows while the standard deviation does not.
The Normal distribution is justified by the Central Limit Theorem — averages of large i.i.d. samples converge to Normal regardless of the underlying distribution. It is fully characterized by mean and variance, enabling closed-form inference. It fails for heavy-tailed data, skewed outcomes, bounded quantities, and rare extreme events.