Loading

Quipoin Menu

Learn • Practice • Grow

pyspark / Caching and Optimization
tutorial

Caching and Optimization

Caching stores intermediate results in memory or disk, speeding up repeated actions on the same DataFrame. It is essential for iterative algorithms and when you reuse a DataFrame multiple times.

Caching a DataFrame

df.cache()
# or
df.persist()
`cache()` uses default storage level (memory only). `persist()` allows choosing storage level (MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY).

Storage Levels

  • MEMORY_ONLY: store in memory; if not enough space, recompute.
  • MEMORY_AND_DISK: store in memory; spill to disk if needed.
  • DISK_ONLY: store only on disk.

Checking Cached Data

spark.catalog.isCached("table_name")
spark.catalog.clearCache()

Uncaching

df.unpersist()

When to Cache

  • The DataFrame will be used multiple times (e.g., in a loop).
  • The same aggregation is performed repeatedly.
  • After expensive transformations like joins or aggregations.
  • Do not cache small DataFrames – overhead may outweigh benefit.


Two Minute Drill
  • `.cache()` stores DataFrame in memory for reuse.
  • `.persist(StorageLevel)` offers more control.
  • Use cache when the same DataFrame is accessed multiple times.
  • Call `.unpersist()` to free resources when done.

Need more clarification?

Drop us an email at career@quipoinfotech.com