Data Engineering Medium Asked at DatabricksAsked at AmazonAsked at NetflixAsked at Meta

How does caching and persist work in Spark, and when should you use each storage level?

For Data Engineer Data Scientist ML Engineer

The short answer

cache() stores a DataFrame in executor memory using the default MEMORY_AND_DISK storage level. persist() lets you choose the storage level — memory-only, disk-only, serialized, or replicated. Use caching when a DataFrame is reused multiple times in the same application; without it, Spark recomputes the entire lineage from scratch on each action.

How to think about it

Without caching, every action triggers a full recomputation from the source. Caching trades memory for recomputation cost.

cache() vs persist()

cache() is a shortcut for persist(StorageLevel.MEMORY_AND_DISK). persist() lets you specify a storage level:

from pyspark import StorageLevel

df = spark.read.parquet("large_features/").filter("active = true")

# Default: memory first, spill to disk if needed
df.cache()

# Explicit: memory only, deserialized (fastest access, most memory)
df.persist(StorageLevel.MEMORY_ONLY)

# Serialized in memory (smaller footprint, CPU overhead to deserialize)
df.persist(StorageLevel.MEMORY_ONLY_SER)

# Disk only (slowest, but doesn't consume executor heap)
df.persist(StorageLevel.DISK_ONLY)

# 2 replicas in memory for fault tolerance
df.persist(StorageLevel.MEMORY_AND_DISK_2)

When caching pays off

# Without cache: features is computed twice — once for model A, once for model B
features = spark.read.parquet("raw/").withColumn("norm", col("value") / 100)
model_a_result = features.filter("segment = 'A'").groupBy("id").agg(...)
model_b_result = features.filter("segment = 'B'").groupBy("id").agg(...)

# With cache: features is computed once and reused
features.cache()
features.count()  # trigger the cache population
model_a_result = features.filter("segment = 'A'").groupBy("id").agg(...)
model_b_result = features.filter("segment = 'B'").groupBy("id").agg(...)
features.unpersist()  # release memory when done

Storage level trade-offs

Level	Memory use	Speed	Fault tolerant
MEMORY_ONLY	High	Fastest	No
MEMORY_AND_DISK	Medium	Fast (disk fallback)	No
MEMORY_ONLY_SER	Lower	Slower (deserialize)	No
DISK_ONLY	None	Slowest	No
MEMORY_AND_DISK_2	2x	Fast	Yes

unpersist()

Always call unpersist() when a cached DataFrame is no longer needed. Spark’s LRU eviction will eventually remove it, but explicit unpersist frees memory immediately for subsequent stages.

How does caching and persist work in Spark, and when should you use each storage level?

cache() vs persist()

When caching pays off

Storage level trade-offs

unpersist()

Keep practising

Explore further