Averages That Lie
Your dashboard says average revenue per customer is $590 — but almost no real customer spends that. Here is why the mean misleads on skewed business data, and what to use instead.
What you'll learn
- Why the arithmetic mean is pulled by outliers and rarely represents a 'typical' customer
- What the median is and why it stays honest when data is skewed
- How percentiles (p50, p90, p99) describe the whole picture
- The Pareto principle and why you manage the head and tail of your customer base differently
Before you start
Your dashboard is not broken. The math is correct. The problem is that the arithmetic mean — the most familiar kind of average — is easily hijacked by a single big spender, and on most business data that hijacking happens constantly. By the end of this lesson you will be able to look at any “average” on a report and immediately ask the right follow-up question.
The dataset that breaks the dashboard
Imagine your business has exactly 10 customers this month:
| Customer | Monthly spend |
|---|---|
| Customers 1–9 | $100 each |
| Customer 10 (the whale) | $5,000 |
The mean — the arithmetic average, calculated by adding all values and dividing by the count — is:
(9 × $100 + $5,000) ÷ 10 = $5,900 ÷ 10 = $590
So the dashboard reports $590. Now go talk to your customers. Nine of the ten spend $100. One spends $5,000. The “average customer” who spends $590 does not exist. Not one actual person is near that number.
Why the mean gets dragged around
The mean gives every data point equal weight in the sum. Customer 10 contributes $5,000 to the numerator while each of the other nine contributes only $100. One observation fifty times larger than the others drags the entire average upward into empty space — a gap where nobody actually sits.
This is not a flaw you can fix by collecting more data. It is a mathematical property of the mean. Whenever data has a long tail — a few very large values stretching out to the right — the mean will live somewhere in that tail, away from the crowd.
The median: the honest middle
The median is the middle value when all observations are sorted from smallest to largest — by definition, half the values fall below it and half above. It does not add values up; it just finds the center of the ranked list.
Sort our 10 customers by spend:
$100, $100, $100, $100, $100, $100, $100, $100, $100, $5,000
With 10 values the median sits between the 5th and 6th entries — both are $100 — so the median = $100.
That is the honest answer to “what does a typical customer spend?” Nine out of ten customers spend $100. The median says so; the mean does not.
Visualizing the gap
The diagram below plots all 10 customers as dots along a revenue axis. Notice where the mean lives: in the empty gap between the cluster and the whale.
Skewed data is the normal case in business
Revenue per customer, order values, salary distributions, session lengths, support ticket resolution times — nearly all of these are right-skewed (meaning the tail points to the right, toward large values). A handful of power users, big orders, or senior executives pull the mean up and away from the typical experience.
In skewed data the median is almost always the more honest description of “typical.” The mean is not useless — it tells you the total divided by count, which matters for budgeting — but it is a terrible proxy for the individual customer.
Percentiles: describing the whole shape
A single number — mean or median — always hides information. Percentiles give you the shape. A percentile is the value below which a given percentage of observations fall.
- p50 is the 50th percentile, which is identical to the median: half of customers spend less, half spend more.
- p90 means 90% of customers spend less than this value. If p90 = $350, then the top 10% of customers spend $350 or more — a very different group from the median customer.
- p99 is the top 1%: your whales.
In our 10-customer example the p90 is the 9th value in the sorted list, which is $100 (all of customers 1–9 spend $100). Only p100 — the maximum — reaches $5,000. Percentiles make the whale visible without letting it distort every other number.
Practically, a product team might track p50 and p90 load times separately: p50 tells you the median user experience; p90 tells you how bad it gets for the slowest 10%.
The Pareto principle and managing the tail
There is a well-documented pattern across many businesses called the Pareto principle — also called the 80/20 rule — which observes that roughly 80% of revenue tends to come from roughly 20% of customers (the exact ratio varies, but the lopsidedness is common). Your one whale in a 10-customer sample is a textbook version.
This matters for strategy: the top 20% of customers (your high-value segment) often deserve different attention — dedicated account managers, loyalty programs, early access — than the long tail of occasional small spenders. Averaging everyone together obscures that the two groups need completely different treatment.
The mean hides the whale. The median finds the crowd. Percentiles show you both.
What to report instead
| Metric | What it tells you | When to use it |
|---|---|---|
| Mean | Total ÷ count; useful for budgeting | Always show it, but never alone |
| Median (p50) | The typical individual experience | Default for describing “average customer” |
| p90 | The ceiling for 90% of customers | Spotting the power-user or high-load tail |
| p99 | The whale tier | Account management, outlier investigation |
A good dashboard shows at minimum: mean, median, and p90. If mean is close to median, the data is roughly symmetric and the mean is a fair summary. If mean is materially larger than median, you are looking at a skewed distribution and the story is in the gap.
Next
Segmentation and RFM — stop averaging customers; group them by recency, frequency, and monetary value so you can manage each segment on its own terms.
Quick check
Practice this in an interview
All questionsThe mean is distorted by skewness and outliers, masks multimodality, and can describe a value that no individual in the dataset actually holds. Skewed, heavy-tailed, or multimodal distributions almost always require the median, percentiles, or the full distributional picture rather than the mean.
Regression to the mean is the statistical tendency for extreme measurements to be followed by measurements closer to the population mean, purely due to random noise — not because of any intervention. Analysts who intervene after observing an extreme value and then observe improvement often incorrectly attribute the recovery to their action.
Mean is optimal for symmetric, outlier-free data; median is the go-to for skewed distributions or when outliers are real rather than errors; mode is the only sensible average for nominal/categorical data. Robustness is a formal concept — the median's breakdown point is 50%, meaning half the data can be corrupted before it fails, while the mean's breakdown point is essentially 0%.
Charts mislead when visual area or slope no longer encodes the underlying ratio faithfully. The three most common traps are a truncated y-axis that magnifies trivial differences, dual axes that let the designer set any ratio between scales, and 3D perspective that foreshortens far elements and inflates near ones.