datarekha
Patterns June 2, 2026

Why the average customer does not exist

The mean is a liar on skewed data, and almost all business data is skewed — here is how to stop building products for a customer who never existed.

10 min read · by datarekha · statisticsanalyticsbusiness-metricsdata-literacydistributions

Here is a number that shows up in every business review deck: average revenue per customer. It sounds rigorous. It is a single, clean, authoritative figure. And in most companies — because most business data is skewed toward a small number of very large buyers — it is quietly, systematically wrong.

Not wrong in the sense of a calculation error. Wrong in a deeper sense: it describes a customer who does not exist.

The simplest version of the problem

Imagine a SaaS product with ten customers. Nine of them pay $100 a month. One is a large enterprise — a whale, in sales jargon — paying $5,000 a month.

Total monthly revenue: (9 × 100) + 5000 = 5900 dollars. Divide by ten customers. Average revenue per customer: $590.

Now look at your customer list. Find the customer who spends $590. You will not find one. Not because the math is wrong. Because the mean — the arithmetic average, computed by summing all values and dividing by count — has been dragged far from the center of the actual distribution by one extreme value. Your ten customers cluster around $100 and one lonely spike lives at $5,000. The number $590 sits in the empty gap between them.

The median — the value that sits exactly in the middle of your sorted list, with half of observations below it and half above — tells a very different story. Sort ten customers by spend. The middle falls between the fifth and sixth customer. Both spend $100. The median is $100.

Mean: $590. Median: $100. Same data set. Factor-of-six difference.

$0$1,000$2,000$3,000$4,000$5,000Median $100Mean $590whale$5,000empty gap — nobody here
Nine customers cluster at $100. One whale sits at $5,000. The mean ($590) floats in dead space between them. The median ($100) sits where the customers actually are.

Why the mean moves toward the big

The arithmetic mean has a property that is genuinely useful in symmetric distributions but catastrophic in skewed ones: every observation pulls the mean toward itself in proportion to its distance from the current center.

A customer spending $5,000 in a pool of $100 spenders does not just add one data point. It yanks the mean 490 dollars in its direction — a force no single $100 customer can counteract. Add another whale and the mean moves again. The nine ordinary customers are statistically outvoted by one extreme value.

This is not a flaw in the math. It is the mean doing exactly what it is designed to do: reflect the total divided by the count. When your goal is to know the total, the mean is perfect. When your goal is to know what a typical customer looks like — their needs, their price sensitivity, the features they actually use — the mean is the wrong question.

The median does not care about extremes. It cares about position. Move your whale from $5,000 to $50,000 and the median stays at $100. That stability is the feature: the median tells you where the bulk of your distribution lives, undistorted.

Business data is almost always skewed

Here is the uncomfortable truth behind most dashboards: the distributions that matter in business are rarely symmetric. Revenue, order value, session duration, support ticket count, user-generated content volume — all of these tend to cluster at low values with a long tail dragging toward high values.

Statisticians call this a right-skewed distribution (the tail extends to the right, toward higher values). The specific shape — how long and heavy that tail is — varies. But the direction is nearly universal in commerce, because the underlying drivers are multiplicative rather than additive. A power user does not use your product ten times more than average; they use it a hundred times more. An enterprise deal is not twice the price of a self-serve plan; it is twenty times.

Three patterns compress this intuition:

The mean drifts right of the median. In any right-skewed distribution, the mean is larger than the median. The gap between them is a signal: the bigger the gap, the more your distribution is being pulled by outliers. If mean equals median, you are dealing with something symmetric. If mean is 5x the median, you have a very heavy tail.

The Pareto pattern. In 1906, Vilfredo Pareto observed that 80% of Italy’s land was owned by 20% of the population. Business analysts later recognized the same 80/20 pattern — roughly 80% of revenue from 20% of customers — appearing across industries. The exact ratio shifts, but the asymmetry holds: a small cohort of high-value customers accounts for a disproportionate share of everything that matters. In SaaS it often looks closer to 90/10.

Percentiles describe shape; averages collapse it. The p50 (50th percentile, identical to the median) tells you the typical customer. The p90 (90th percentile) tells you what the top 10% looks like. The p99 tells you about your whales. Together they sketch the shape of your distribution. A single mean collapses that shape into a single dimensionless number and throws away the information you most need.

The products we build for ghosts

The practical damage from mean-addiction is not just aesthetic. It shapes decisions.

A product team looks at “average session duration: 12 minutes” and concludes that users are deeply engaged. But if the median session is 90 seconds and a handful of power users average two hours, the team is building features for the power users while the majority churned in the first minute and a half. The feature roadmap optimizes for a usage pattern that describes 3% of the base.

A pricing team sets the upgrade threshold based on average usage and finds that most customers never hit it — because average usage was pulled up by a small number of heavy users who were never going to churn anyway. The paywall is invisible to the customers who need to be nudged.

A growth team celebrates “average revenue per user up 15% this quarter.” But if that increase came entirely from two new enterprise contracts while the SMB segment stagnated, the mean has hidden a segment-level crisis behind an aggregate success.

The mean flattens. It erases segment dynamics, hides bimodal distributions, and makes your data look healthier — or sicker — than it is.

StatisticWhat it measuresWhat it hidesUse whenMeanTotal ÷ countOutlier effect, shapeSymmetric data,summing totalsMedian (p50)Middle value, typicalTail magnitudeSkewed data,typical user storyPercentiles(p75, p90, p99)Distribution shapeNothing — they show shapeAlways pair withmedian
Mean, median, and percentiles answer different questions. None is universally better — but the mean is uniquely dangerous on skewed data when misread as “typical.”

What the gap is telling you

There is a diagnostic you can run on any metric in thirty seconds: compute both mean and median and look at their ratio.

If mean / median is close to 1.0, your distribution is roughly symmetric. The mean is trustworthy as a description of a typical value.

If mean / median is between 1.5 and 3, your distribution is moderately right-skewed. Report both. Segment before drawing conclusions.

If mean / median exceeds 3, your distribution is heavily skewed. The mean is being driven by a small number of outliers. In our ten-customer example the ratio is 590 / 100 = 5.9. That is a flashing warning sign that the mean describes almost nobody in your actual population.

The ratio is not a formal statistical test. It is a smell. It tells you to slow down, disaggregate the data, and ask who the outliers are before you let the mean drive a strategic decision.

The 80/20 in practice

The Pareto principle — named for that Italian land observation but generalized into a rule of thumb across economic systems — is not a law of nature. It is an observation that tends to hold when value creation is multiplicative and concentrated.

In a typical B2B SaaS business, if you rank customers by annual contract value and look at the top 20%, you will frequently find they account for somewhere between 70% and 90% of revenue. The exact split varies, but the qualitative point is universal: your customer list is not a list of equals.

This matters for several reasons that the mean obscures.

First, churn risk is not symmetric. Losing one whale can wipe out more revenue than losing twenty median customers. A 1% monthly churn rate on a portfolio where 90% of revenue comes from 10% of accounts is a very different risk profile than a 1% rate on a uniform distribution.

Second, acquisition economics look different by segment. Your customer acquisition cost (the total sales and marketing spend divided by the number of new customers acquired in a period) might average out to something manageable. But if the economics of signing an enterprise customer involve a six-month sales cycle and a solutions engineer, while the economics of a self-serve signup involve a Google ad click and a credit card form, averaging these costs together produces a number that describes neither business accurately.

Third, the product your whales need is not always the product your median customer needs. Enterprise buyers often want SSO, audit logs, dedicated support, and custom SLAs. SMB buyers want to be onboarded in ten minutes and never talk to anyone. Building for the average of those two groups produces something mediocre for both.

How this shows up in the wild

E-commerce companies discovered this early. Amazon’s average order value across their entire customer base says almost nothing useful about how to design the checkout experience for any specific segment. The calculation that actually drives decisions is the distribution of order value, segmented by acquisition source, product category, and customer tenure.

Fintech companies face an extreme version. Revolut’s average balance per user looks healthy at the aggregate level. But the vast majority of accounts hold small balances — the product is used for travel money and currency conversion — while a small number of users hold large investment portfolios. Treating these as the same customer and optimizing for the mean would be incoherent product strategy.

Subscription businesses know this instinctively when they talk to customers, but their dashboards often betray them. Monthly Recurring Revenue (the predictable, recurring portion of revenue normalized to a monthly figure) is typically reported at the aggregate and the per-customer average. Both numbers hide the bimodal or heavily-skewed nature of the underlying distribution.

The analysts who catch this early — who notice that the mean is floating in empty space between two real clusters — are the ones who find the actual product and pricing insight. The ones who report the mean without the median are quietly building for a customer who was never there.

What to do instead

Reporting the mean is not wrong. It is necessary for computing totals, understanding aggregate economics, and modeling growth. The problem is reporting it alone.

Every mean should ship with its median. Every summary should note whether the two are close or divergent. If they diverge substantially, the right next move is to look at percentiles — p50, p75, p90, p99 — and let the shape of the distribution inform the question rather than a single collapsed number.

Segment before you summarize. Revenue per customer, computed across all customers in a single number, is often less useful than revenue per customer within each acquisition cohort, or within each plan tier, or within each industry vertical. Segmentation does not eliminate skew, but it isolates it to the right group, where the outlier story is actionable rather than distorting.

Be especially suspicious of averages in presentations designed to celebrate. The mean has a way of making things look better than they are when the tail is pulling up — and looking worse than they are when the tail is pulling down. It is a number with a direction bias that depends on which way the skew falls.

The $590 customer does not exist. Nine customers exist at $100 and one exists at $5,000, and they need different things, have different risk profiles, and generate different unit economics. The moment you replace that reality with a synthetic average, you have introduced a fiction into your decision-making process.

The mean is a useful tool. It is not a description of reality. Know the difference, always pair it with the median, and stop optimizing for a customer who was never in your database.

Skip to content