datarekha
Data Engineering Medium Asked at DatabricksAsked at AmazonAsked at NetflixAsked at Meta

How does caching and persist work in Spark, and when should you use each storage level?

The short answer

cache() stores a DataFrame in executor memory using the default MEMORY_AND_DISK storage level. persist() lets you choose the storage level — memory-only, disk-only, serialized, or replicated. Use caching when a DataFrame is reused multiple times in the same application; without it, Spark recomputes the entire lineage from scratch on each action.

How to think about it

Without caching, every action triggers a full recomputation from the source. Caching trades memory for recomputation cost.

cache() vs persist()

cache() is a shortcut for persist(StorageLevel.MEMORY_AND_DISK). persist() lets you specify a storage level:

from pyspark import StorageLevel

df = spark.read.parquet("large_features/").filter("active = true")

# Default: memory first, spill to disk if needed
df.cache()

# Explicit: memory only, deserialized (fastest access, most memory)
df.persist(StorageLevel.MEMORY_ONLY)

# Serialized in memory (smaller footprint, CPU overhead to deserialize)
df.persist(StorageLevel.MEMORY_ONLY_SER)

# Disk only (slowest, but doesn't consume executor heap)
df.persist(StorageLevel.DISK_ONLY)

# 2 replicas in memory for fault tolerance
df.persist(StorageLevel.MEMORY_AND_DISK_2)

When caching pays off

# Without cache: features is computed twice — once for model A, once for model B
features = spark.read.parquet("raw/").withColumn("norm", col("value") / 100)
model_a_result = features.filter("segment = 'A'").groupBy("id").agg(...)
model_b_result = features.filter("segment = 'B'").groupBy("id").agg(...)

# With cache: features is computed once and reused
features.cache()
features.count()  # trigger the cache population
model_a_result = features.filter("segment = 'A'").groupBy("id").agg(...)
model_b_result = features.filter("segment = 'B'").groupBy("id").agg(...)
features.unpersist()  # release memory when done

Storage level trade-offs

LevelMemory useSpeedFault tolerant
MEMORY_ONLYHighFastestNo
MEMORY_AND_DISKMediumFast (disk fallback)No
MEMORY_ONLY_SERLowerSlower (deserialize)No
DISK_ONLYNoneSlowestNo
MEMORY_AND_DISK_22xFastYes

unpersist()

Always call unpersist() when a cached DataFrame is no longer needed. Spark’s LRU eviction will eventually remove it, but explicit unpersist frees memory immediately for subsequent stages.

Keep practising

All Data Engineering questions

Explore further

Skip to content