datarekha

OLAP: Concept Hierarchies & Measures

Roll up sales to the year, drill back down to the day, and learn which aggregates you can merge from sub-totals — and which need every raw row.

7 min read Intermediate GATE DA Lesson 76 of 122

What you'll learn

  • Concept hierarchies — day → month → quarter → year, city → state → country
  • OLAP operations: roll-up, drill-down, slice, dice
  • Distributive measures (SUM, COUNT, MIN, MAX) merge cleanly from sub-aggregates
  • Algebraic measures (AVG, stddev) reduce to a fixed set of distributive ones
  • Holistic measures (MEDIAN, MODE, percentiles) need every underlying row

Before you start

Picture a sales dashboard. You start at “total sales by day for the last month.” You zoom out — “by month for the last year.” Zoom out again — “by year for the last decade.” Then you fix one slice — “just the Bangalore store” — and compare it across years.

Every one of those moves is a named OLAP operation, and they all rely on two ideas: a concept hierarchy telling the cube how to summarise (day rolls up to month, city rolls up to state), and a measure type telling it which aggregations are safe to combine across sub-groups. That last point matters more than it sounds — get it wrong and you’ll compute an “average of averages” that does not equal the true average.

Concept hierarchies — levels of summary

A concept hierarchy is just a chain saying “this level rolls up to that level.” A few classics:

  • Time: day → month → quarter → year
  • Location: city → state → country
  • Product: SKU → brand → category

These let the warehouse pre-build aggregates at every level. Asking for “sales by year” then becomes a one-row lookup, not a scan of every transaction.

DayMonthQuarterYearRoll-up → coarserDrill-down → finer
A four-level time hierarchy. Roll-up climbs (more aggregated, fewer rows); drill-down descends (more detail, more rows).

The four OLAP operations

  • Roll-up — go up the hierarchy: aggregate sales from month to quarter to year. Coarser, fewer rows.
  • Drill-down — go down: split year sales back into months. Finer, more rows.
  • Slice — fix one dimension to a single value: “only 2024.” The cube drops a dimension.
  • Dice — apply filters on several dimensions at once: “2024 AND Bangalore AND Electronics.” The cube shrinks to a sub-cube.

Measure categorisation — the rule that decides what you can pre-compute

The single most-tested idea here: not every aggregate can be combined from sub-aggregates. Some can. Some cannot. Three categories cover all of them.

  • Distributive — the aggregate over the whole equals an aggregate of the sub-aggregates. The big four: SUM, COUNT, MAX, MIN. Daily sales sums add up to the monthly sum; monthly sums add up to the annual sum. No raw rows needed once the lower-level totals are stored.
  • Algebraic — computable from a fixed number of distributive sub-aggregates. AVG = SUM / COUNT — store the sum and the count per month, recover the year’s average exactly. Standard deviation is similar (needs SUM, COUNT, SUM of squares).
  • Holistic — needs every underlying row; no bounded set of sub-totals is enough. MEDIAN, MODE, rank, percentiles. The median of twelve monthly medians is not the median of the year.

This is exactly why warehouses pre-materialise SUMs and COUNTs at every hierarchy level — and don’t pre-materialise medians.

How GATE asks this

Usually an MCQ: “Classify AVG as distributive, algebraic, or holistic.” Or an MSQ listing measures and asking which are distributive — the answer is always the SUM/COUNT/MIN/MAX subset. Occasionally an MSQ on which OLAP operations move you up vs down the hierarchy.

Worked example — classify three aggregates

A retail warehouse pre-computes monthly aggregates per store. You want annual totals per store without re-scanning the daily rows. Which measures can you recover, and how?

  • SUM(sales) — distributive. Add the twelve monthly sums per store. Annual total recovered exactly.
  • AVG(sales) — algebraic. Store monthly SUM and monthly COUNT separately; then annual AVG = (sum of twelve monthly sums) / (sum of twelve monthly counts). Two distributive sub-aggregates suffice, so it’s algebraic — not distributive on its own (because (avg_jan + avg_feb + ... + avg_dec) / 12 generally does NOT equal the true annual average when month lengths differ).
  • MEDIAN(sales) — holistic. You cannot combine twelve monthly medians into the annual median; for that you need the daily rows. Holistic measures must be recomputed from raw data, which is why warehouses do not pre-store them per hierarchy level.

Quick check

Quick check

0/7
Q1Classify MEDIAN(sales): distributive, algebraic, or holistic?
Q2Classify AVG(sales): distributive, algebraic, or holistic?
Q3Which measures are DISTRIBUTIVE? (select all that apply)select all that apply
Q4Which OLAP operations move you UP a concept hierarchy (i.e., to a coarser summary)? (select all that apply)select all that apply
Q5A warehouse pre-computes the monthly SUM and monthly COUNT of sales per store. From these alone, which aggregates can you exactly recover for the YEAR per store? (select all that apply)select all that apply
Q6Slicing on `Year = 2024` and `City = Bangalore` and `Category = Electronics` simultaneously is which OLAP operation?
Q7COUNT(DISTINCT customer) — the number of unique customers — was not in the worked example. To get the year's distinct-customer count, can you just add up the twelve monthly distinct-customer counts? What does that make the measure?

Practice this in an interview

All questions
What are ROLLUP, CUBE, and GROUPING SETS, and when would you choose each?

All three are extensions to GROUP BY that produce multiple levels of aggregation in a single query. ROLLUP produces hierarchical subtotals, CUBE produces all possible subtotal combinations, and GROUPING SETS lets you specify exactly which grouping combinations you want.

What is the difference between OLTP and OLAP workloads, and how does that drive database design choices?

OLTP systems handle many small, latency-sensitive transactions that read and write a few rows at a time, so they are optimized for fast point lookups and row-level locking. OLAP systems run infrequent but wide analytical queries over millions of rows, so they benefit from columnar storage, bulk scans, and denormalized schemas that minimize joins.

What is the difference between OLTP and OLAP systems, and why can't you run analytics on your production database?

OLTP (Online Transaction Processing) systems handle high-throughput, low-latency reads and writes for individual records — think order placement, user authentication. OLAP (Online Analytical Processing) systems handle complex aggregations over millions of rows for business intelligence. Running heavy analytics directly on an OLTP database locks rows, competes for I/O, and slows application queries that customers feel.

How does columnar storage work, and how does partitioning improve query performance in a data warehouse?

Columnar storage colocates values from the same column on disk, so aggregation queries read only the columns they need rather than full rows — dramatically reducing I/O on wide tables. Partitioning physically separates data into subdirectories (e.g., by date), allowing the query engine to skip entire partitions whose predicate cannot match, cutting scan volume from the full table to just the relevant slice.

Sign in to track your progress

Completed lessons, your XP, level, and streak save to your account — it's free and takes a few seconds.

Explore further

Related lessons

Skip to content