What is the difference between narrow and wide transformations in Spark?
Narrow transformations compute each output partition using data from exactly one input partition — no data moves across the network. Wide transformations require data from multiple input partitions, forcing a shuffle across the network, which is the most expensive operation in a Spark job.
How to think about it
Understanding this split is essential for diagnosing slow Spark jobs and writing efficient pipelines.
Narrow transformations
Each output partition depends on at most one input partition. No shuffle, no network I/O.
Examples: map, filter, flatMap, select, withColumn, union.
# Narrow — each partition is filtered independently, no data movement
df.filter("country = 'IN'").withColumn("score", col("value") * 2)
Narrow transformations can be pipelined within a single stage — Spark collapses them into one task.
Wide transformations
Each output partition may require data from many input partitions. Spark must sort and exchange data across executors — a shuffle.
Examples: groupBy, reduceByKey, join (non-broadcast), distinct, repartition, orderBy.
# Wide — triggers a shuffle; all rows with the same key must land on the same executor
df.groupBy("country").agg(sum("revenue"))
Visualizing the difference
Stage boundaries
Each shuffle creates a new stage. Spark cannot pipeline work across a shuffle — it must write intermediate results to disk, transfer them, and start a new set of tasks. This is why reducing shuffles is the single highest-impact Spark optimization.