Data Engineering Medium Asked at DatabricksAsked at AmazonAsked at GoogleAsked at NetflixAsked at Stripe

What is a broadcast join in Spark and when should you use it?

For Data Engineer MLOps Engineer Data Scientist

The short answer

A broadcast join sends a complete copy of the smaller table to every executor, so the join is done locally without any shuffle. It is the most effective single optimization for joins where one side is small enough to fit in executor memory, eliminating the most expensive network operation in a join.

How to think about it

A standard sort-merge join shuffles both sides of the join across the network. A broadcast join avoids this entirely for small tables.

How it works

The driver collects the small table from all its partitions.
The driver serializes it and broadcasts it to every executor.
Each executor joins its partition of the large table against its local copy of the small table — no network transfer of the large table, no shuffle write.

from pyspark.sql import functions as F
from pyspark.sql.functions import broadcast

large = spark.read.parquet("events/")
small = spark.read.parquet("country_codes/")  # a few thousand rows

# Explicit hint — Spark uses the hint even if the table is above the auto threshold
result = large.join(broadcast(small), "country_code")

Auto-broadcast threshold

Spark automatically broadcasts tables smaller than spark.sql.autoBroadcastJoinThreshold (default 10 MB). Raise this for trusted cluster memory:

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", str(50 * 1024 * 1024))  # 50 MB

When to use it

Dimension tables (countries, products, status codes) joined with a large fact table
Lookup tables used repeatedly in the same job — cache + broadcast
One side of a skewed join that is small enough after filtering

When not to use it

Both sides are large — the broadcast data must fit in each executor’s memory
The join produces a massive output that still needs to be shuffled downstream

# Anti-pattern: broadcasting a 2 GB table to 100 executors = 200 GB of driver serialization
# and executor memory pressure. Use sort-merge join instead.
result = large.join(broadcast(also_large), "id")

What is a broadcast join in Spark and when should you use it?

How it works

Auto-broadcast threshold

When to use it

When not to use it

Keep practising

Explore further