datarekha
Data Engineering Medium Asked at DatabricksAsked at AmazonAsked at AirbnbAsked at LinkedIn

What is the difference between repartition and coalesce in Spark?

The short answer

repartition triggers a full shuffle to produce exactly N evenly distributed partitions and can both increase and decrease partition count. coalesce merges existing partitions on the same or nearby executors without a shuffle, but can only decrease partition count and may produce uneven partitions.

How to think about it

Both operations change the number of partitions, but they have fundamentally different cost profiles and valid use cases.

repartition(n)

Uses a full shuffle to redistribute data uniformly across exactly n partitions using round-robin or hash partitioning. Because it shuffles, each output partition ends up roughly equal in size — ideal before writing output files or before an expensive downstream join.

# Before writing: ensure 100 evenly sized Parquet files
df.repartition(100).write.parquet("output/")

# Hash-partition by a key for a downstream join on the same key
df.repartition(200, "user_id")

Can increase or decrease the partition count. Sorting is not implied.

coalesce(n)

Merges partitions by moving data to already-co-located partitions — no full shuffle. Spark may still move some data to balance the merge, but it avoids the all-to-all network transfer. As a result, output partitions may be uneven.

# After a heavy filter that left 200 tiny partitions, reduce to 20
df.filter("active = true").coalesce(20).write.parquet("output/")

coalesce can only decrease partition count. Calling coalesce(500) on a 100-partition DataFrame is silently ignored.

When to use each

ScenarioUse
Increase partition countrepartition
Even output filesrepartition
Partition by a join/group keyrepartition(n, col)
Reduce partitions after heavy filtercoalesce
Save on shuffle cost when shrinkingcoalesce
# Anti-pattern: repartition(1) to get a single output file — full shuffle, then one
# giant executor task. Prefer coalesce(1) if you really need one file.
df.coalesce(1).write.csv("report.csv")

Keep practising

All Data Engineering questions

Explore further

Skip to content