Loading

Quipoin Menu

Learn • Practice • Grow

pyspark / GroupBy and Aggregations
tutorial

GroupBy and Aggregations

Grouping data and computing aggregates is essential for summarising large datasets. PySpark provides `groupBy()` followed by aggregation functions.

Basic GroupBy Aggregations

from pyspark.sql.functions import count, sum, avg, max, min

df.groupBy("department").agg(avg("salary")).show()
df.groupBy("department").agg(sum("salary"), count("*")).show()
df.groupBy("department", "gender").agg(avg("age")).show()

Using `agg()` with Multiple Aliases

from pyspark.sql.functions import count, avg, sum

df.groupBy("department").agg(
avg("salary").alias("avg_salary"),
count("*").alias("emp_count")
).show()

Using `groupBy` with Multiple Functions

df.groupBy("department").agg(
avg("salary"),
max("salary"),
min("salary")
).show()

Rollup and Cube (Advanced)

df.rollup("department", "gender").agg(avg("salary")).show()
df.cube("department", "gender").agg(avg("salary")).show()


Two Minute Drill
  • `groupBy()` partitions data by columns.
  • Use `agg()` with aggregation functions like `sum`, `avg`, `count`, `max`, `min`.
  • Alias the result columns with `.alias()`.
  • `rollup()` and `cube()` produce subtotals and grand totals.

Need more clarification?

Drop us an email at career@quipoinfotech.com