Selecting and Filtering
Selecting specific columns and filtering rows are the most common DataFrame operations. PySpark provides `select()`, `filter()`, and `where()` methods.
Selecting Columns
df.select("name", "age").show()
df.select(df.name, df.age).show()
from pyspark.sql.functions import col
df.select(col("name"), col("age")).show()Filtering Rows
df.filter(df["age"] > 30).show()
df.where(df["age"] > 30).show()
df.filter("age > 30").show() # SQL syntaxMultiple Conditions (AND, OR)
df.filter((df["age"] > 30) & (df["salary"] > 50000)).show()
df.filter((df["age"] > 30) | (df["salary"] > 50000)).show()Using `isin()` for Multiple Values
df.filter(df["department"].isin("IT", "HR", "Sales")).show()Two Minute Drill
- `select()` chooses columns; `filter()` / `where()` choose rows.
- Use `&` for AND, `|` for OR (wrap each condition in parentheses).
- `isin()` checks membership in a list.
- Column expressions can be written with `col()` or attribute syntax.
Need more clarification?
Drop us an email at career@quipoinfotech.com
