Loading

Quipoin Menu

Learn • Practice • Grow

pyspark / Handling Missing Data
tutorial

Handling Missing Data

Real datasets almost always have missing values. PySpark provides `dropna()` and `fillna()` to handle nulls.

Detecting Missing Values

df.filter(df["column"].isNull()).show()
from pyspark.sql.functions import isnan, isnull
df.select(isnull("column").alias("is_null")).show()

Dropping Rows with Missing Values

df_clean = df.dropna() # drop rows with any null
df_clean = df.dropna(subset=["age", "salary"]) # only check these columns
df_clean = df.dropna(thresh=3) # keep rows with at least 3 non‑null values

Filling Missing Values

df_filled = df.fillna(0) # fill all numeric nulls with 0
df_filled = df.fillna("unknown", subset=["name"]) # fill specific column
df_filled = df.fillna({"age": 30, "city": "Unknown"}) # dict per column

Using Mean/Median for Imputation

from pyspark.sql.functions import mean
mean_age = df.select(mean("age")).collect()[0][0]
df_filled = df.fillna(mean_age, subset=["age"])


Two Minute Drill
  • `dropna()` removes rows with nulls.
  • `fillna()` replaces nulls with a value.
  • Use `subset` to restrict operations to specific columns.
  • Compute mean/median from the DataFrame for imputation.

Need more clarification?

Drop us an email at career@quipoinfotech.com