Loading

Quipoin Menu

Learn • Practice • Grow

pyspark / Sorting and Limiting
tutorial

Sorting and Limiting

Sorting and limiting are used to order data and view top or bottom rows. PySpark provides `orderBy()` and `limit()` methods.

Sorting (Ordering)

df_sorted = df.orderBy("age") # ascending
df_sorted = df.orderBy(df["age"].desc()) # descending
df_sorted = df.orderBy("department", "salary".desc()) # multiple columns

Sorting with SQL Expression

from pyspark.sql.functions import desc
df_sorted = df.orderBy(desc("salary"))

Limiting Rows

df_top = df.limit(10) # first 10 rows

Combining Sort and Limit

df_top_10_by_salary = df.orderBy(desc("salary")).limit(10)

Note on Performance

`limit()` is not deterministic unless combined with `orderBy()`. For random sampling, use `sample()` instead.


Two Minute Drill
  • `orderBy()` sorts rows; use `desc()` for descending order.
  • `limit()` returns the first N rows.
  • Combine `orderBy()` and `limit()` to get top N rows.
  • Sorting on large datasets can be expensive; use when necessary.

Need more clarification?

Drop us an email at career@quipoinfotech.com