Window Functions
Window functions perform calculations across a set of rows related to the current row, without collapsing them. They are used for ranking, running totals, moving averages, etc.
Creating a Window Specification
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, rank, dense_rank, lag, lead
window_spec = Window.partitionBy("department").orderBy(df["salary"].desc())Row Number
df.withColumn("row_num", row_number().over(window_spec)).show()Rank and Dense Rank
df.withColumn("rank", rank().over(window_spec)).show()
df.withColumn("dense_rank", dense_rank().over(window_spec)).show()Lag and Lead (Previous / Next Row)
df.withColumn("prev_salary", lag("salary").over(window_spec)).show()
df.withColumn("next_salary", lead("salary").over(window_spec)).show()Running Total (Sum)
from pyspark.sql.functions import sum
running_total = Window.partitionBy("department").orderBy("date").rowsBetween(Window.unboundedPreceding, Window.currentRow)
df.withColumn("running_total", sum("amount").over(running_total)).show()Two Minute Drill
- Window functions operate on a set of rows without collapsing them.
- Use `Window.partitionBy()` to group, `orderBy()` to sort.
- Common functions: `row_number`, `rank`, `dense_rank`, `lag`, `lead`.
- Define range with `rowsBetween()` for running totals.
Need more clarification?
Drop us an email at career@quipoinfotech.com
