datarekha
Data Engineering Medium Asked at DatabricksAsked at AmazonAsked at GoogleAsked at NetflixAsked at Stripe

What is a broadcast join in Spark and when should you use it?

The short answer

A broadcast join sends a complete copy of the smaller table to every executor, so the join is done locally without any shuffle. It is the most effective single optimization for joins where one side is small enough to fit in executor memory, eliminating the most expensive network operation in a join.

How to think about it

A standard sort-merge join shuffles both sides of the join across the network. A broadcast join avoids this entirely for small tables.

How it works

  1. The driver collects the small table from all its partitions.
  2. The driver serializes it and broadcasts it to every executor.
  3. Each executor joins its partition of the large table against its local copy of the small table — no network transfer of the large table, no shuffle write.
from pyspark.sql import functions as F
from pyspark.sql.functions import broadcast

large = spark.read.parquet("events/")
small = spark.read.parquet("country_codes/")  # a few thousand rows

# Explicit hint — Spark uses the hint even if the table is above the auto threshold
result = large.join(broadcast(small), "country_code")

Auto-broadcast threshold

Spark automatically broadcasts tables smaller than spark.sql.autoBroadcastJoinThreshold (default 10 MB). Raise this for trusted cluster memory:

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", str(50 * 1024 * 1024))  # 50 MB

When to use it

  • Dimension tables (countries, products, status codes) joined with a large fact table
  • Lookup tables used repeatedly in the same job — cache + broadcast
  • One side of a skewed join that is small enough after filtering

When not to use it

  • Both sides are large — the broadcast data must fit in each executor’s memory
  • The join produces a massive output that still needs to be shuffled downstream
# Anti-pattern: broadcasting a 2 GB table to 100 executors = 200 GB of driver serialization
# and executor memory pressure. Use sort-merge join instead.
result = large.join(broadcast(also_large), "id")

Keep practising

All Data Engineering questions

Explore further

Skip to content