Loading

Quipoin Menu

Learn • Practice • Grow

pyspark / Selecting and Filtering
tutorial

Selecting and Filtering

Selecting specific columns and filtering rows are the most common DataFrame operations. PySpark provides `select()`, `filter()`, and `where()` methods.

Selecting Columns

df.select("name", "age").show()
df.select(df.name, df.age).show()
from pyspark.sql.functions import col
df.select(col("name"), col("age")).show()

Filtering Rows

df.filter(df["age"] > 30).show()
df.where(df["age"] > 30).show()
df.filter("age > 30").show() # SQL syntax

Multiple Conditions (AND, OR)

df.filter((df["age"] > 30) & (df["salary"] > 50000)).show()
df.filter((df["age"] > 30) | (df["salary"] > 50000)).show()

Using `isin()` for Multiple Values

df.filter(df["department"].isin("IT", "HR", "Sales")).show()


Two Minute Drill
  • `select()` chooses columns; `filter()` / `where()` choose rows.
  • Use `&` for AND, `|` for OR (wrap each condition in parentheses).
  • `isin()` checks membership in a list.
  • Column expressions can be written with `col()` or attribute syntax.

Need more clarification?

Drop us an email at career@quipoinfotech.com