Data Engineering Medium Asked at DatabricksAsked at AmazonAsked at GoogleAsked at NetflixAsked at Meta

What is the difference between narrow and wide transformations in Spark?

For Data Engineer MLOps Engineer Data Scientist

The short answer

Narrow transformations compute each output partition using data from exactly one input partition — no data moves across the network. Wide transformations require data from multiple input partitions, forcing a shuffle across the network, which is the most expensive operation in a Spark job.

How to think about it

Understanding this split is essential for diagnosing slow Spark jobs and writing efficient pipelines.

Narrow transformations

Each output partition depends on at most one input partition. No shuffle, no network I/O.

Examples: map, filter, flatMap, select, withColumn, union.

# Narrow — each partition is filtered independently, no data movement
df.filter("country = 'IN'").withColumn("score", col("value") * 2)

Narrow transformations can be pipelined within a single stage — Spark collapses them into one task.

Wide transformations

Each output partition may require data from many input partitions. Spark must sort and exchange data across executors — a shuffle.

Examples: groupBy, reduceByKey, join (non-broadcast), distinct, repartition, orderBy.

# Wide — triggers a shuffle; all rows with the same key must land on the same executor
df.groupBy("country").agg(sum("revenue"))

Visualizing the difference

Narrow (left): each output partition maps to one input. Wide (right): partitions from all inputs cross the network to form new output partitions.

Stage boundaries

Each shuffle creates a new stage. Spark cannot pipeline work across a shuffle — it must write intermediate results to disk, transfer them, and start a new set of tasks. This is why reducing shuffles is the single highest-impact Spark optimization.