GroupBy is a three-act play: split, apply, combine

In Q3 last year, a sales dataset landed in my inbox: 84 000 transaction rows, four regions, twelve months. The ask was simple — revenue by region, month over month growth, flag any region that underperformed by more than fifteen percent. Three people on the team wrote three different scripts. They all got the same numbers. Two of them were wrong about why their code worked, and one of them hit an off-by-one error in month ordering because they did not understand what GroupBy was actually doing under the hood. The model in their head was broken, so their code was brittle.

The model that fixes all of it is six words: split, apply, combine.

It was named and formalized by Hadley Wickham in his 2011 paper on the plyr R package, and Wes McKinney built Pandas GroupBy directly on top of the same abstraction. SQL GROUP BY predates both by decades and does the same thing, though it hides the machinery. Once you internalize this three-act structure, you stop asking “what does GroupBy do?” and start asking “which act am I in, and what should come out?”

Act One: Split

The split step takes your full table — one flat rectangle of rows — and partitions it into sub-tables, one per unique value of the key column (or columns). Nothing is computed yet. Nothing is reduced. You are just drawing invisible dividing lines through your data.

With a sales table that has a region column containing “North,” “South,” and “West,” a groupby("region") call mentally produces three independent DataFrames:

North: all rows where region equals “North”
South: all rows where region equals “South”
West: all rows where region equals “West”

In Pandas this split is lazy — it does not materialize those sub-frames in memory the moment you call groupby. It stores the grouping logic and waits. SQL does the same thing; GROUP BY is a declaration, not an execution order. The actual partition happens when you commit to an operation. But conceptually, the data is already divided.

This laziness is not a footnote. It is why you can chain .groupby("region") and then choose any of the three operations independently. The split step does not presuppose what you will do next.

Eight rows split by region into three independent groups, a sum applied to each, and the results combined into a three-row summary.

Act Two: Apply

The apply step is where the real thinking happens — and where most people stop being precise. A function is called once for each group, on just that group’s rows, as if the rest of the table does not exist. The group is independent. The function sees only its own rows.

This is the part that separates beginners from practitioners: understanding what shape the function must return, because that determines which operation you are actually doing.

Three possibilities, three names:

Aggregation (agg) returns a scalar per group. You feed in a column of numbers; you get back one number. Sum, mean, max, count, standard deviation — anything that collapses a vector to a point. The output will have one row per group. The original rows are gone; only the summaries survive.

Transformation (transform) returns a value for every row in the group, and that value must be the same length as the group it came from. The canonical use is broadcasting a group-level statistic back to every row — replacing each sale amount with the group’s mean, or computing each row’s share of its group’s total. The output is the same shape as the input.

Filtration (filter) takes a group and returns a boolean: keep this whole group, or discard it. The unit of decision is the group, not the row. You cannot keep three rows from North and drop two; you either keep all of North or none of it.

These three are not arbitrary variants. They map exactly onto three types of questions:

What is the total / average / max for each group? — that is agg.
For each row, how does it compare to its group? — that is transform.
Which groups pass a threshold? — that is filter.

When someone says “GroupBy is confusing,” what they usually mean is that they did not know which of these three questions they were asking.

Act Three: Combine

The combine step takes the independent per-group results and assembles them into a coherent return value. Pandas handles this automatically; you never call it explicitly. But understanding what combine does — and what it cannot do — prevents a class of subtle mistakes.

For agg, combine stacks the per-group scalar results into a new DataFrame with the group keys as the index. If you grouped by ["region", "month"], the combined output has a two-level MultiIndex — one label per unique (region, month) pair, one row per combination.

For transform, combine re-aligns the per-group results back to the original row positions using the same index the split step remembered. This is why transform output has exactly the same length and index as the input DataFrame. The alignment is not magic; it is the split step’s bookkeeping paying off.

For filter, combine simply drops the discarded groups and concatenates what remains. The output index still refers to the original row labels from the surviving groups.

The implication: after agg, you cannot access the original rows without another join. After transform, you can access both the original values and the group statistics in the same row, side by side. After filter, you have a subset of the original rows, untouched.

The sales example, concretely

Take 84 000 transaction rows. Columns: region, month, revenue.

Aggregation gives you total revenue by region. One row per region:

df.groupby("region")["revenue"].sum()
# North    9_800_000
# South    9_400_000
# West    11_500_000

Transformation gives you each transaction’s share of its region total — without collapsing:

df["region_share"] = (
    df["revenue"]
    / df.groupby("region")["revenue"].transform("sum")
)

Every row in the North group now has the same denominator (9 800 000). The row count is still 84 000. You can plot individual transactions alongside their normalized position in their group. This is impossible to do cleanly with agg alone.

Filtration drops any region whose total revenue falls below a threshold:

df.groupby("region").filter(
    lambda g: g["revenue"].sum() >= 10_000_000
)

South (9 400 000) and North (9 800 000) are dropped entirely. Only West survives — all its original rows, with their original values, intact.

The SQL parallel

SQL GROUP BY is the agg case only, and it is baked into the language syntax. Every column in the SELECT must either be in GROUP BY or wrapped in an aggregate function — because the engine can only return one row per group, and every non-key column must resolve to a scalar.

SQL does not have a built-in transform or filter at the group level. You get them through HAVING (which is group-level filtering, exactly like Pandas filter) and through window functions (which are transform without the GroupBy machinery). Window functions, which retain every row while attaching group statistics, are the SQL equivalent of transform — and their relative obscurity explains why so many analysts reach for subquery joins when they want per-row group statistics.

The three GroupBy operations mapped by return shape, SQL analog, and practical use case.

Where the model breaks — intentionally

There is a fourth method on GroupBy objects that does not fit the three-act frame: apply. The lowercase apply (as opposed to the conceptual “apply” step) is an escape hatch. You pass it an arbitrary function that receives a sub-DataFrame and can return anything — a scalar, a same-shape frame, a differently-shaped frame, or a series. Pandas tries to figure out how to recombine the results and sometimes guesses wrong.

This flexibility is precisely why experienced practitioners avoid apply unless they have no alternative. It is slow (it bypasses vectorized operations), it is unpredictable (combine behavior depends on the return type), and it signals that you have not found the right question yet. If you can express your intent as an aggregation, a transformation, or a filtration, you should. apply is the last resort, not the first tool.

Why the model matters beyond syntax

A wrong mental model does not just produce wrong code. It produces code that is right today and breaks silently when the data changes.

The team member who got the off-by-one error in month ordering had modeled GroupBy as “sort and summarize.” He thought the groups would come out in chronological order because he had written sort_values("month") before the groupby. In fact, Pandas GroupBy preserves insertion order of first occurrence by default — not sort order of the key column. His sort was irrelevant; the groups came out in the order the regions first appeared in the raw file. When the raw file was re-exported with a different row ordering, his month-over-month calculation broke.

If he had been thinking in split-apply-combine terms, the question would have been immediate: does the split step respect sort order? The answer is no. The fix is trivial: sort_index() on the result, not sort_values on the input.

The mental model is not just a conceptual nicety. It is a debugging tool. It tells you which act to look at when something goes wrong. Wrong number of rows in the output? Check the combine step — which operation did you use? Group statistics not matching expectations? Check the split step — is your key the right granularity? A transform result that is all NaN? Check the apply step — did your function return the right length?

Split, apply, combine. Three acts. Every GroupBy operation you will ever write lives somewhere in that structure. Find where, and the rest is just syntax.