Handling Missing Data

Missing data is a common problem. A dataset may have blank cells, `NaN`, or placeholders like `?`. ML models cannot handle missing values – you must address them before training.

How Missing Data Occurs

Human error during data entry.
Sensor failure or data collection issues.
Incomplete records (e.g., survey participants skip questions).

Option 1: Remove Missing Values

import pandas as pd

df = pd.read_csv('data.csv')
# Drop rows with any missing value
df_clean = df.dropna()
# Drop columns with missing values
df_clean = df.dropna(axis=1)

Use when missing data is few (less than 5%). Losing too many rows wastes information.

Option 2: Impute (Fill) Missing Values

Replace missing values with a sensible estimate:

Mean/median for numerical features.
Mode (most frequent) for categorical features.
Forward fill / backward fill for time series.

# Fill numerical column with mean
df['age'].fillna(df['age'].mean(), inplace=True)

# Using scikit-learn's SimpleImputer
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
df[['age']] = imputer.fit_transform(df[['age']])

Which Strategy to Choose?

Small dataset, few missing values → drop rows.
Large dataset, many missing values → impute.
Time series → forward/backward fill or interpolation.

Detect Missing Values First

print(df.isnull().sum())   # count missing per column
print(df.info())           # shows non‑null counts

Two Minute Drill

Missing data must be handled – models cannot process `NaN`.
Remove rows/columns with dropna() (if few missing).
Impute with mean, median, mode using fillna() or SimpleImputer.
Always check missing values with isnull().sum().

Need more clarification?

Drop us an email at career@quipoinfotech.com

Welcome to Quipoin

Quipoin Menu

Handling Missing Data

Need more clarification?