When is the mean a misleading summary statistic, and what should you use instead?

The mean is distorted by skewness and outliers, masks multimodality, and can describe a value that no individual in the dataset actually holds. Skewed, heavy-tailed, or multimodal distributions almost always require the median, percentiles, or the full distributional picture rather than the mean.

What is regression to the mean, and why does it fool analysts into seeing treatment effects that do not exist?

Regression to the mean is the statistical tendency for extreme measurements to be followed by measurements closer to the population mean, purely due to random noise — not because of any intervention. Analysts who intervene after observing an extreme value and then observe improvement often incorrectly attribute the recovery to their action.

When should you use mean vs median vs mode, and which is most robust to outliers?

Mean is optimal for symmetric, outlier-free data; median is the go-to for skewed distributions or when outliers are real rather than errors; mode is the only sensible average for nominal/categorical data. Robustness is a formal concept — the median's breakdown point is 50%, meaning half the data can be corrupted before it fails, while the mean's breakdown point is essentially 0%.

What makes a chart misleading, and how do you spot truncated y-axes, dual axes, and 3D distortion?

Charts mislead when visual area or slope no longer encodes the underlying ratio faithfully. The three most common traps are a truncated y-axis that magnifies trivial differences, dual axes that let the designer set any ratio between scales, and 3D perspective that foreshortens far elements and inflates near ones.

Averages That Lie — Business Analytics

The last lesson built an entire model — ARPU, lifetime, LTV — on the word average, then ended by doubting it: one whale and forty tiny accounts can produce the same average as everyone paying the same. This lesson is that doubt, made rigorous.

Your dashboard is not broken. The math is correct. The problem is that the arithmetic mean — the most familiar kind of average — is easily hijacked by a single big spender, and on most business data that hijacking happens constantly. By the end of this lesson you will be able to look at any “average” on a report and immediately ask the right follow-up question.

The dataset that breaks the dashboard

Imagine your business has exactly 10 customers this month:

Customer	Monthly spend
Customers 1–9	$100 each
Customer 10 (the whale)	$5,000

The mean — the arithmetic average, calculated by adding all values and dividing by the count — is:

(9 × $100 + $5,000) ÷ 10 = $5,900 ÷ 10 = $590

So the dashboard reports $590. Now go talk to your customers. Nine of the ten spend $100. One spends $5,000. The “average customer” who spends $590 does not exist. Not one actual person is near that number.

Why the mean gets dragged around

The mean gives every data point equal weight in the sum. Customer 10 contributes $5,000 to the numerator while each of the other nine contributes only $100. One observation fifty times larger than the others drags the entire average upward into empty space — a gap where nobody actually sits.

This is not a flaw you can fix by collecting more data. It is a mathematical property of the mean. Whenever data has a long tail — a few very large values stretching out to the right — the mean will live somewhere in that tail, away from the crowd.

The median: the honest middle

The median is the middle value when all observations are sorted from smallest to largest — by definition, half the values fall below it and half above. It does not add values up; it just finds the center of the ranked list.

Sort our 10 customers by spend:

$100, $100, $100, $100, $100, $100, $100, $100, $100, $5,000

With 10 values the median sits between the 5th and 6th entries — both are $100 — so the median = $100.

That is the honest answer to “what does a typical customer spend?” Nine out of ten customers spend $100. The median says so; the mean does not.

Visualizing the gap

The diagram below plots all 10 customers as dots along a revenue axis. Notice where the mean lives: in the empty gap between the cluster and the whale.

Nine customers cluster at $100; one whale sits at $5,000. The mean ($590) floats in the gap where no customer actually is. The median ($100) sits with the crowd.

Skewed data is the normal case in business

Revenue per customer, order values, salary distributions, session lengths, support ticket resolution times — nearly all of these are right-skewed (meaning the tail points to the right, toward large values). A handful of power users, big orders, or senior executives pull the mean up and away from the typical experience.

In skewed data the median is almost always the more honest description of “typical.” The mean is not useless — it tells you the total divided by count, which matters for budgeting — but it is a terrible proxy for the individual customer.

Percentiles: describing the whole shape

A single number — mean or median — always hides information. Percentiles give you the shape. A percentile is the value below which a given percentage of observations fall.

p50 is the 50th percentile, which is identical to the median: half of customers spend less, half spend more.
p90 means 90% of customers spend less than this value. If p90 = $350, then the top 10% of customers spend $350 or more — a very different group from the median customer.
p99 is the top 1%: your whales.

In our 10-customer example the p90 is the 9th value in the sorted list, which is $100 (all of customers 1–9 spend $100). Only p100 — the maximum — reaches $5,000. Percentiles make the whale visible without letting it distort every other number.

Practically, a product team might track p50 and p90 load times separately: p50 tells you the median user experience; p90 tells you how bad it gets for the slowest 10%.

The Pareto principle and managing the tail

There is a well-documented pattern across many businesses called the Pareto principle — also called the 80/20 rule — which observes that roughly 80% of revenue tends to come from roughly 20% of customers (the exact ratio varies, but the lopsidedness is common). Your one whale in a 10-customer sample is a textbook version.

This matters for strategy: the top 20% of customers (your high-value segment) often deserve different attention — dedicated account managers, loyalty programs, early access — than the long tail of occasional small spenders. Averaging everyone together obscures that the two groups need completely different treatment.

The mean hides the whale. The median finds the crowd. Percentiles show you both.

What to report instead

Metric	What it tells you	When to use it
Mean	Total ÷ count; useful for budgeting	Always show it, but never alone
Median (p50)	The typical individual experience	Default for describing “average customer”
p90	The ceiling for 90% of customers	Spotting the power-user or high-load tail
p99	The whale tier	Account management, outlier investigation

A good dashboard shows at minimum: mean, median, and p90. If mean is close to median, the data is roughly symmetric and the mean is a fair summary. If mean is materially larger than median, you are looking at a skewed distribution and the story is in the gap.

In one breath

The arithmetic mean gives every value equal weight, so a single whale drags it into empty space where no real customer sits — $590 average when nine of ten spend $100. On business data, which is almost always right-skewed (a long tail of big values), the median — the middle of the ranked list — is the honest answer to “what’s typical,” and it’s immune to how extreme the outlier gets. Better still, report percentiles (p50, p90, p99) to see the whole shape, not one summary. The gap between mean and median is the story: large gap means skew, and skew means a few customers (the Pareto 20%) behave nothing like the crowd — so manage the head and the tail differently instead of averaging them into a fiction.

Practice

Quick check

0/3

Q1In our 10-customer dataset (nine at $100, one at $5,000), why does the mean equal $590 while the median equals $100?

Q2A startup reports 'average session length: 14 minutes.' The median session length is 3 minutes. What does this most likely mean?

Q3Your e-commerce platform records daily orders for 100 customers. The p90 of order value is $800. What does this tell you?

A question to carry forward

We ended on a strategic hint: the whale and the long tail deserve different treatment. But notice we only ever sorted customers by one axis — how much they spend. Monetary value alone is a blunt instrument. Two customers who each spent $1,000 last year look identical on that axis, yet one bought yesterday and the other vanished eight months ago. One is your future; one is already gone.

So the question to carry forward is: what would it take to group customers by behaviour, not just by a single number? The next lesson, segmentation and RFM, stops averaging customers altogether and sorts them on three axes at once — Recency (how lately they bought), Frequency (how often), and Monetary value (how much) — turning one misleading average into a handful of segments you can actually act on.

Averages That Lie

What you'll learn

Before you start

The dataset that breaks the dashboard

Why the mean gets dragged around

The median: the honest middle

Visualizing the gap

Skewed data is the normal case in business

Percentiles: describing the whole shape

The Pareto principle and managing the tail

What to report instead

In one breath

Practice

Quick check

A question to carry forward

Sign in to track your progress

Practice this in an interview

Related lessons

Explore further