Caching and Optimization

Caching stores intermediate results in memory or disk, speeding up repeated actions on the same DataFrame. It is essential for iterative algorithms and when you reuse a DataFrame multiple times.

Caching a DataFrame

df.cache()
# or
df.persist()

`cache()` uses default storage level (memory only). `persist()` allows choosing storage level (MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY).

Storage Levels

MEMORY_ONLY: store in memory; if not enough space, recompute.
MEMORY_AND_DISK: store in memory; spill to disk if needed.
DISK_ONLY: store only on disk.

Checking Cached Data

spark.catalog.isCached("table_name")
spark.catalog.clearCache()

Uncaching

df.unpersist()

When to Cache

The DataFrame will be used multiple times (e.g., in a loop).
The same aggregation is performed repeatedly.
After expensive transformations like joins or aggregations.
Do not cache small DataFrames – overhead may outweigh benefit.

Two Minute Drill

`.cache()` stores DataFrame in memory for reuse.
`.persist(StorageLevel)` offers more control.
Use cache when the same DataFrame is accessed multiple times.
Call `.unpersist()` to free resources when done.

Need more clarification?

Drop us an email at career@quipoinfotech.com

Welcome to Quipoin

Quipoin Menu

Caching and Optimization

Need more clarification?