GroupBy and Aggregations
Grouping data and computing aggregates is essential for summarising large datasets. PySpark provides `groupBy()` followed by aggregation functions.
Basic GroupBy Aggregations
from pyspark.sql.functions import count, sum, avg, max, min
df.groupBy("department").agg(avg("salary")).show()
df.groupBy("department").agg(sum("salary"), count("*")).show()
df.groupBy("department", "gender").agg(avg("age")).show()Using `agg()` with Multiple Aliases
from pyspark.sql.functions import count, avg, sum
df.groupBy("department").agg(
avg("salary").alias("avg_salary"),
count("*").alias("emp_count")
).show()Using `groupBy` with Multiple Functions
df.groupBy("department").agg(
avg("salary"),
max("salary"),
min("salary")
).show()Rollup and Cube (Advanced)
df.rollup("department", "gender").agg(avg("salary")).show()
df.cube("department", "gender").agg(avg("salary")).show()Two Minute Drill
- `groupBy()` partitions data by columns.
- Use `agg()` with aggregation functions like `sum`, `avg`, `count`, `max`, `min`.
- Alias the result columns with `.alias()`.
- `rollup()` and `cube()` produce subtotals and grand totals.
Need more clarification?
Drop us an email at career@quipoinfotech.com
