Sorting and Limiting
Sorting and limiting are used to order data and view top or bottom rows. PySpark provides `orderBy()` and `limit()` methods.
Sorting (Ordering)
df_sorted = df.orderBy("age") # ascending
df_sorted = df.orderBy(df["age"].desc()) # descending
df_sorted = df.orderBy("department", "salary".desc()) # multiple columnsSorting with SQL Expression
from pyspark.sql.functions import desc
df_sorted = df.orderBy(desc("salary"))Limiting Rows
df_top = df.limit(10) # first 10 rowsCombining Sort and Limit
df_top_10_by_salary = df.orderBy(desc("salary")).limit(10)Note on Performance
`limit()` is not deterministic unless combined with `orderBy()`. For random sampling, use `sample()` instead.
Two Minute Drill
- `orderBy()` sorts rows; use `desc()` for descending order.
- `limit()` returns the first N rows.
- Combine `orderBy()` and `limit()` to get top N rows.
- Sorting on large datasets can be expensive; use when necessary.
Need more clarification?
Drop us an email at career@quipoinfotech.com
