datarekha
Patterns June 2, 2026

Window functions changed how I write SQL

The leap from GROUP BY to window functions is the single biggest level-up an analyst makes — and the intuition behind OVER, PARTITION BY, and LAG is simpler than most tutorials admit.

10 min read · by datarekha · sqlanalyticswindow-functionsdata-engineering

In 2018, a senior analyst at a fintech I was consulting for handed me a SQL query to review. It was sixty lines long. It answered a question that can be answered in twelve lines today. The query computed each customer’s cumulative spend alongside their individual transactions — and it did it with a self-join that scanned the transactions table twice, a correlated subquery in the SELECT list, and a comment that said “don’t touch this, it’s fragile.” Every analyst who has been around long enough has a version of that query in their past.

Window functions made it obsolete. Not because they are syntactically clever, but because they model the problem more honestly. The problem was never “collapse transactions into totals.” The problem was “for each transaction row, how much has this customer spent up to and including this moment?” Those are different questions. GROUP BY answers the first. Window functions answer the second.

That distinction — keeping every row versus collapsing rows — is the entire conceptual leap. Everything else is mechanics.

What GROUP BY actually does to your data

GROUP BY is a reduction. You hand it five rows and ask a question; it gives you back one row and a number. The detail is gone. That is the right tool when you genuinely want a summary table: total revenue by region, average order size by category, count of users by cohort. Summary tables are useful. But they are the end of the road. Once the rows are collapsed, you cannot ask “and for each original row, how does this value compare to its group’s total?” because the original rows no longer exist.

The classic workaround is to compute the summary in a subquery and then join it back to the detail table. You have written this. So have I.

SELECT
    t.customer_id,
    t.transaction_date,
    t.amount,
    s.total_spend
FROM transactions t
JOIN (
    SELECT customer_id, SUM(amount) AS total_spend
    FROM transactions
    GROUP BY customer_id
) s ON t.customer_id = s.customer_id;

This works. It is also conceptually redundant — you are scanning the same table twice to answer one question — and it scales badly on large datasets because the optimizer has to reconcile two passes. The self-join version is worse: it requires the correlated subquery to execute once per outer row, which on a million-row table can take orders of magnitude longer than the window equivalent.

The deeper problem is readability. When someone reads that query cold, they have to mentally reconstruct the join condition before they can understand the intent. The query is about the machinery, not the idea.

OVER: the clause that changes everything

A window function is any aggregate, ranking, or offset function followed by an OVER clause. The OVER clause defines a “window” — a subset of rows that the function can see when computing each output value. Critically, the window is defined per output row, and the output row itself is preserved in the result.

SELECT
    customer_id,
    transaction_date,
    amount,
    SUM(amount) OVER (PARTITION BY customer_id ORDER BY transaction_date) AS running_total
FROM transactions;

Every row in the transactions table appears in the output. The running_total column holds the cumulative sum of amount for that customer, ordered by date, up to and including that row. The original amount is still there. The transaction_date is still there. Nothing was destroyed.

OVER takes up to three sub-clauses:

  • PARTITION BY divides the rows into independent groups. The function resets at each partition boundary. Think of it as GROUP BY for the window — but without collapsing. If you omit it, the entire result set is one partition.
  • ORDER BY establishes a sort order within each partition. For ranking and cumulative functions, this is what makes the result meaningful. Without ORDER BY, SUM is just a total across the whole partition, not a running total.
  • A frame specification (ROWS BETWEEN, RANGE BETWEEN) defines which rows within the ordered partition are “in scope” for the current row’s calculation. Most databases default to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW when you specify ORDER BY, which gives you the running total behavior. You rarely need to override this, but knowing it exists matters when you want moving averages.
GROUP BYcollapses 5 rows → 1 summarySUM(…) OVER (…)keeps all 5 rows, adds running totaldateamountcustomerJan 1$40AJan 3$60AJan 5$30AJan 6$80AJan 9$50A↓ collapsedcustomer A · $2605 rows gone. Only the total remains.dateamountrunning_totalJan 1$40$40Jan 3$60$100Jan 5$30$130Jan 6$80$210Jan 9$50$260↓ all rows keptDetail intact. Aggregate attached. Both in one pass.
GROUP BY destroys the five source rows to produce one total. SUM with OVER keeps all five rows and appends a running total computed within the window.

Ranking: ROW_NUMBER, RANK, and DENSE_RANK

Ranking functions are where window functions go from “convenient” to “essential.” The classic problem: find the most recent order for each customer. With GROUP BY alone, you cannot do it without a subquery, because you need the order detail alongside the grouped maximum date. With ROW_NUMBER, you just rank and filter.

SELECT *
FROM (
    SELECT
        customer_id,
        order_id,
        order_date,
        amount,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id
            ORDER BY order_date DESC
        ) AS rn
    FROM orders
) ranked
WHERE rn = 1;

ROW_NUMBER assigns 1, 2, 3… with no ties. If two orders share the same date, one gets 1 and the other gets 2 — which one is arbitrary unless you add a tiebreaker to the ORDER BY. RANK, by contrast, assigns the same rank to ties but skips the next rank: 1, 1, 3. DENSE_RANK does not skip: 1, 1, 2.

The choice matters for business logic. In a competition leaderboard you might want RANK so the audience understands someone tied for first. In a “pick exactly one row per customer” deduplication you want ROW_NUMBER with a deterministic ORDER BY so the result is stable across runs.

A pattern that shows up constantly in analytics engineering is top-N per group. Give me the top three products by revenue in each category. The naive approach involves correlated subqueries that become exponentially slower as the table grows. The window approach is linear:

SELECT *
FROM (
    SELECT
        category,
        product_name,
        revenue,
        DENSE_RANK() OVER (
            PARTITION BY category
            ORDER BY revenue DESC
        ) AS rnk
    FROM product_sales
) ranked
WHERE rnk <= 3;

This single scan of product_sales, combined with a filter in the outer query, replaces what would otherwise be a three-way join or a nested EXISTS check.

LAG and LEAD: accessing adjacent rows without a self-join

An analyst working on a subscription business spends a lot of time comparing a value to its prior-period equivalent. Month-over-month revenue change. Previous login date. Days between consecutive purchases. All of these require looking at a different row from the current one — and before window functions, that meant a self-join on the same table, typically joining on some calculated offset that was fragile and hard to read.

LAG and LEAD are the honest solution. LAG reaches backward into the ordered window; LEAD reaches forward.

SELECT
    customer_id,
    order_date,
    amount,
    LAG(amount) OVER (
        PARTITION BY customer_id
        ORDER BY order_date
    ) AS prev_amount,
    amount - LAG(amount) OVER (
        PARTITION BY customer_id
        ORDER BY order_date
    ) AS delta
FROM orders;

Each row now carries the previous order’s amount and the difference. No join. No subquery. The first order per customer returns NULL for prev_amount, which is the correct answer — there is no prior row, and NULL is SQL’s way of saying “unknown or absent,” not “zero.”

LAG and LEAD accept an optional offset argument (defaulting to 1) and an optional default value for when the offset falls outside the partition. LAG(amount, 2, 0) looks two rows back and returns 0 if no such row exists. These details matter when you are computing multi-period comparisons — quarterly vs. year-ago-quarter — and you want the query to be self-contained rather than requiring a calendar table join.

The frame clause: moving averages and the ROWS vs RANGE subtlety

The frame clause is where most analysts stop reading the documentation, and it is the source of a genuinely confusing class of bugs.

When you write SUM(amount) OVER (ORDER BY transaction_date), the default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. RANGE is logical — it includes all rows whose ORDER BY value is less than or equal to the current row’s value. If two transactions happen on the same date, both get the same running total (the total including both). That is usually what you want.

For a 7-day moving average, you want ROWS, not RANGE, and you want a specific number:

SELECT
    transaction_date,
    daily_revenue,
    AVG(daily_revenue) OVER (
        ORDER BY transaction_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS moving_avg_7d
FROM daily_revenue;

ROWS counts literal row positions, not logical ranges of values. If there are gaps in the date sequence — weekends, for example, with no transactions — ROWS will include the preceding 6 rows regardless of the date gap, while a RANGE-based approach keyed on actual date arithmetic would require a different technique entirely (generating a date spine and left-joining). Know which behavior you need before you write the frame.

FUNCTIONWHAT IT COMPUTESOLD WORKAROUNDSUM / AVG OVERORDER BY …Running total or movingaverage per partitionSelf-join + correlatedsubquery in SELECTROW_NUMBER OVERPARTITION BY …Unique sequential rank;no ties, deduplicationSubquery with MAX date+ join back to detailRANK / DENSE_RANKOVER ORDER BY …Top-N per group; handlesties correctly by designNested EXISTS orthree-way joinLAG / LEAD OVERPARTITION BY ORDER BYPrior/next row value;period-over-period deltaSelf-join onoffset row number
Four core window function patterns, what each computes, and the SQL workaround it replaces in a single-pass query.

The before-and-after: a real example

Consider this requirement, which is genuinely common in product analytics: for each user session, show the session’s revenue and the percentage of that user’s total revenue that session represents. Under the old model:

SELECT
    s.user_id,
    s.session_id,
    s.session_revenue,
    s.session_revenue / t.total_revenue AS pct_of_total
FROM sessions s
JOIN (
    SELECT user_id, SUM(session_revenue) AS total_revenue
    FROM sessions
    GROUP BY user_id
) t ON s.user_id = t.user_id;

Two scans. A derived table. A join. The query plan has to hash-aggregate the inner query before the outer join can proceed. Now:

SELECT
    user_id,
    session_id,
    session_revenue,
    session_revenue / SUM(session_revenue) OVER (
        PARTITION BY user_id
    ) AS pct_of_total
FROM sessions;

One scan. No join. The window function computes the per-user total alongside each row. The optimizer in virtually every modern database — BigQuery, Redshift, Snowflake, Postgres, DuckDB — handles this more efficiently than the subquery join, because it can compute the window aggregate in a single pass over the sorted partitions.

The percentage here, session_revenue / SUM(session_revenue) OVER (PARTITION BY user_id), is a fraction whose denominator is the full partition total. It does not need ORDER BY because there is no accumulation — every row in the partition should see the same denominator. This is one of the more useful intuitions about window functions: ORDER BY is only needed when the answer depends on which rows come before the current one. Totals and averages over the whole partition do not need it; running totals and ranks do.

Where this lives in industry

Analysts and data engineers use window functions constantly. In analytics engineering — the discipline of building data models in SQL transform tools like dbt — window functions are the primary tool for:

  • Session attribution: assigning each event to its originating marketing source using LAG to find the last touch before conversion
  • Cohort retention: computing ROW_NUMBER over each user’s event sequence to identify “nth event” behavior
  • Slowly changing dimensions (tracking how a value like price changes over time and which version was active at any point): LEAD to find the effective end date of each version
  • Deduplication: ROW_NUMBER() OVER (PARTITION BY natural_key ORDER BY updated_at DESC) to pick the most recent version of each record in a source system that lacks proper primary keys

In streaming SQL systems — Flink, Spark Structured Streaming — window functions translate into sliding and tumbling windows over event time, with PARTITION BY mapping to stream partitions. The semantics change slightly because unbounded preceding is often unsafe in an infinite stream, but the intuition carries over: you are defining a scope for computation relative to each event, not collapsing events into summaries.

The ubiquity is not accidental. The relational model is fundamentally about sets of rows, and any interesting analysis requires comparing each row to some context: its group, its neighbors in time, its rank among peers. Window functions are the algebraic primitive for expressing that context without destroying the rows themselves.

The mental model worth keeping

GROUP BY is a lossy transform. It is the right choice when the output is genuinely a summary. Window functions are a lossless annotation. They are the right choice when you want the detail row to carry derived context. The two are complementary, not competing — you will use both in the same query when you need to, say, annotate each order with a running total and then group those annotated rows by month to build a chart.

The moment this clicks, you will stop writing self-joins and correlated subqueries for the class of problems they were never the right answer to. That sixty-line query I reviewed in 2018 came down to fourteen lines when I rewrote it. The analyst who wrote it had not learned window functions yet — not because they were incurious, but because the teaching material of the time was dominated by GROUP BY intuitions and most tutorials treated OVER as a footnote. It is not a footnote. It is the sentence.

Skip to content