How does caching and persist work in Spark, and when should you use each storage level?
cache() stores a DataFrame in executor memory using the default MEMORY_AND_DISK storage level. persist() lets you choose the storage level — memory-only, disk-only, serialized, or replicated. Use caching when a DataFrame is reused multiple times in the same application; without it, Spark recomputes the entire lineage from scratch on each action.
How to think about it
Without caching, every action triggers a full recomputation from the source. Caching trades memory for recomputation cost.
cache() vs persist()
cache() is a shortcut for persist(StorageLevel.MEMORY_AND_DISK). persist() lets you specify a storage level:
from pyspark import StorageLevel
df = spark.read.parquet("large_features/").filter("active = true")
# Default: memory first, spill to disk if needed
df.cache()
# Explicit: memory only, deserialized (fastest access, most memory)
df.persist(StorageLevel.MEMORY_ONLY)
# Serialized in memory (smaller footprint, CPU overhead to deserialize)
df.persist(StorageLevel.MEMORY_ONLY_SER)
# Disk only (slowest, but doesn't consume executor heap)
df.persist(StorageLevel.DISK_ONLY)
# 2 replicas in memory for fault tolerance
df.persist(StorageLevel.MEMORY_AND_DISK_2)
When caching pays off
# Without cache: features is computed twice — once for model A, once for model B
features = spark.read.parquet("raw/").withColumn("norm", col("value") / 100)
model_a_result = features.filter("segment = 'A'").groupBy("id").agg(...)
model_b_result = features.filter("segment = 'B'").groupBy("id").agg(...)
# With cache: features is computed once and reused
features.cache()
features.count() # trigger the cache population
model_a_result = features.filter("segment = 'A'").groupBy("id").agg(...)
model_b_result = features.filter("segment = 'B'").groupBy("id").agg(...)
features.unpersist() # release memory when done
Storage level trade-offs
| Level | Memory use | Speed | Fault tolerant |
|---|---|---|---|
| MEMORY_ONLY | High | Fastest | No |
| MEMORY_AND_DISK | Medium | Fast (disk fallback) | No |
| MEMORY_ONLY_SER | Lower | Slower (deserialize) | No |
| DISK_ONLY | None | Slowest | No |
| MEMORY_AND_DISK_2 | 2x | Fast | Yes |
unpersist()
Always call unpersist() when a cached DataFrame is no longer needed. Spark’s LRU eviction will eventually remove it, but explicit unpersist frees memory immediately for subsequent stages.