Window Functions

Window functions perform calculations across a set of rows related to the current row, without collapsing them. They are used for ranking, running totals, moving averages, etc.

Creating a Window Specification

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, rank, dense_rank, lag, lead

window_spec = Window.partitionBy("department").orderBy(df["salary"].desc())

Row Number

df.withColumn("row_num", row_number().over(window_spec)).show()

Rank and Dense Rank

df.withColumn("rank", rank().over(window_spec)).show()
df.withColumn("dense_rank", dense_rank().over(window_spec)).show()

Lag and Lead (Previous / Next Row)

df.withColumn("prev_salary", lag("salary").over(window_spec)).show()
df.withColumn("next_salary", lead("salary").over(window_spec)).show()

Running Total (Sum)

from pyspark.sql.functions import sum
running_total = Window.partitionBy("department").orderBy("date").rowsBetween(Window.unboundedPreceding, Window.currentRow)
df.withColumn("running_total", sum("amount").over(running_total)).show()

Two Minute Drill

Window functions operate on a set of rows without collapsing them.
Use `Window.partitionBy()` to group, `orderBy()` to sort.
Common functions: `row_number`, `rank`, `dense_rank`, `lag`, `lead`.
Define range with `rowsBetween()` for running totals.

Need more clarification?

Drop us an email at career@quipoinfotech.com

Welcome to Quipoin

Quipoin Menu

Window Functions

Need more clarification?