Handling Missing Data
Real datasets almost always have missing values. PySpark provides `dropna()` and `fillna()` to handle nulls.
Detecting Missing Values
df.filter(df["column"].isNull()).show()
from pyspark.sql.functions import isnan, isnull
df.select(isnull("column").alias("is_null")).show()Dropping Rows with Missing Values
df_clean = df.dropna() # drop rows with any null
df_clean = df.dropna(subset=["age", "salary"]) # only check these columns
df_clean = df.dropna(thresh=3) # keep rows with at least 3 non‑null valuesFilling Missing Values
df_filled = df.fillna(0) # fill all numeric nulls with 0
df_filled = df.fillna("unknown", subset=["name"]) # fill specific column
df_filled = df.fillna({"age": 30, "city": "Unknown"}) # dict per columnUsing Mean/Median for Imputation
from pyspark.sql.functions import mean
mean_age = df.select(mean("age")).collect()[0][0]
df_filled = df.fillna(mean_age, subset=["age"])Two Minute Drill
- `dropna()` removes rows with nulls.
- `fillna()` replaces nulls with a value.
- Use `subset` to restrict operations to specific columns.
- Compute mean/median from the DataFrame for imputation.
Need more clarification?
Drop us an email at career@quipoinfotech.com
