datarekha
Data Engineering Easy Asked at AmazonAsked at DatabricksAsked at MicrosoftAsked at Uber

What is lazy evaluation in Spark, and how does it differ from transformations vs actions?

The short answer

Spark does not execute any computation when you call a transformation — it builds a DAG of logical steps. Only when you call an action does Spark compile that DAG into physical tasks and execute them. This design lets Catalyst optimize the full query plan before touching any data.

How to think about it

Lazy evaluation is Spark’s core execution model. It separates the description of a computation from its execution.

Transformations — building the plan

A transformation takes a DataFrame and returns a new DataFrame without triggering any computation. Spark records the operation in the logical plan.

df = spark.read.parquet("events.parquet")   # no read yet
filtered = df.filter("event_type = 'click'")  # no computation
grouped = filtered.groupBy("user_id").count()  # still no computation

Common transformations: filter, select, groupBy, join, withColumn, map (on RDDs).

Actions — triggering execution

An action causes Spark to evaluate the entire DAG accumulated so far, optimize it, and run tasks across the cluster.

grouped.show()      # action — executes everything above
grouped.count()     # action
grouped.write.parquet("out/")  # action

Common actions: show, count, collect, write, take, first, foreach.

Why lazy evaluation matters

  1. Optimization window — Catalyst sees the full plan before execution, enabling filter push-down, column pruning, and join reordering.
  2. Fault tolerance — the DAG is the lineage record. On partition failure, Spark replays only the relevant transformations.
  3. Efficiency — unreferenced columns are pruned before data is read. Reading a 1 TB Parquet file but selecting two columns only scans those column chunks.
# Without lazy evaluation, this would scan all columns twice.
# With it, Catalyst merges both selects and reads only name + age.
df.select("name").filter("age > 30").count()

Keep practising

All Data Engineering questions

Explore further

Skip to content