What is the difference between repartition and coalesce in Spark?
repartition triggers a full shuffle to produce exactly N evenly distributed partitions and can both increase and decrease partition count. coalesce merges existing partitions on the same or nearby executors without a shuffle, but can only decrease partition count and may produce uneven partitions.
How to think about it
Both operations change the number of partitions, but they have fundamentally different cost profiles and valid use cases.
repartition(n)
Uses a full shuffle to redistribute data uniformly across exactly n partitions using round-robin or hash partitioning. Because it shuffles, each output partition ends up roughly equal in size — ideal before writing output files or before an expensive downstream join.
# Before writing: ensure 100 evenly sized Parquet files
df.repartition(100).write.parquet("output/")
# Hash-partition by a key for a downstream join on the same key
df.repartition(200, "user_id")
Can increase or decrease the partition count. Sorting is not implied.
coalesce(n)
Merges partitions by moving data to already-co-located partitions — no full shuffle. Spark may still move some data to balance the merge, but it avoids the all-to-all network transfer. As a result, output partitions may be uneven.
# After a heavy filter that left 200 tiny partitions, reduce to 20
df.filter("active = true").coalesce(20).write.parquet("output/")
coalesce can only decrease partition count. Calling coalesce(500) on a 100-partition DataFrame is silently ignored.
When to use each
| Scenario | Use |
|---|---|
| Increase partition count | repartition |
| Even output files | repartition |
| Partition by a join/group key | repartition(n, col) |
| Reduce partitions after heavy filter | coalesce |
| Save on shuffle cost when shrinking | coalesce |
# Anti-pattern: repartition(1) to get a single output file — full shuffle, then one
# giant executor task. Prefer coalesce(1) if you really need one file.
df.coalesce(1).write.csv("report.csv")