Caching and Optimization
Caching stores intermediate results in memory or disk, speeding up repeated actions on the same DataFrame. It is essential for iterative algorithms and when you reuse a DataFrame multiple times.
Caching a DataFrame
df.cache()
# or
df.persist()`cache()` uses default storage level (memory only). `persist()` allows choosing storage level (MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY).Storage Levels
- MEMORY_ONLY: store in memory; if not enough space, recompute.
- MEMORY_AND_DISK: store in memory; spill to disk if needed.
- DISK_ONLY: store only on disk.
Checking Cached Data
spark.catalog.isCached("table_name")
spark.catalog.clearCache()Uncaching
df.unpersist()When to Cache
- The DataFrame will be used multiple times (e.g., in a loop).
- The same aggregation is performed repeatedly.
- After expensive transformations like joins or aggregations.
- Do not cache small DataFrames – overhead may outweigh benefit.
Two Minute Drill
- `.cache()` stores DataFrame in memory for reuse.
- `.persist(StorageLevel)` offers more control.
- Use cache when the same DataFrame is accessed multiple times.
- Call `.unpersist()` to free resources when done.
Need more clarification?
Drop us an email at career@quipoinfotech.com
