What is lazy evaluation in Spark, and how does it differ from transformations vs actions?
Spark does not execute any computation when you call a transformation — it builds a DAG of logical steps. Only when you call an action does Spark compile that DAG into physical tasks and execute them. This design lets Catalyst optimize the full query plan before touching any data.
How to think about it
Lazy evaluation is Spark’s core execution model. It separates the description of a computation from its execution.
Transformations — building the plan
A transformation takes a DataFrame and returns a new DataFrame without triggering any computation. Spark records the operation in the logical plan.
df = spark.read.parquet("events.parquet") # no read yet
filtered = df.filter("event_type = 'click'") # no computation
grouped = filtered.groupBy("user_id").count() # still no computation
Common transformations: filter, select, groupBy, join, withColumn, map (on RDDs).
Actions — triggering execution
An action causes Spark to evaluate the entire DAG accumulated so far, optimize it, and run tasks across the cluster.
grouped.show() # action — executes everything above
grouped.count() # action
grouped.write.parquet("out/") # action
Common actions: show, count, collect, write, take, first, foreach.
Why lazy evaluation matters
- Optimization window — Catalyst sees the full plan before execution, enabling filter push-down, column pruning, and join reordering.
- Fault tolerance — the DAG is the lineage record. On partition failure, Spark replays only the relevant transformations.
- Efficiency — unreferenced columns are pruned before data is read. Reading a 1 TB Parquet file but selecting two columns only scans those column chunks.
# Without lazy evaluation, this would scan all columns twice.
# With it, Catalyst merges both selects and reads only name + age.
df.select("name").filter("age > 30").count()