Loading

Quipoin Menu

Learn • Practice • Grow

machine-learning / Handling Missing Data
tutorial

Handling Missing Data

Missing data is a common problem. A dataset may have blank cells, `NaN`, or placeholders like `?`. ML models cannot handle missing values – you must address them before training.

How Missing Data Occurs

  • Human error during data entry.
  • Sensor failure or data collection issues.
  • Incomplete records (e.g., survey participants skip questions).

Option 1: Remove Missing Values

import pandas as pd

df = pd.read_csv('data.csv')
# Drop rows with any missing value
df_clean = df.dropna()
# Drop columns with missing values
df_clean = df.dropna(axis=1)
Use when missing data is few (less than 5%). Losing too many rows wastes information.

Option 2: Impute (Fill) Missing Values

Replace missing values with a sensible estimate:
  • Mean/median for numerical features.
  • Mode (most frequent) for categorical features.
  • Forward fill / backward fill for time series.
# Fill numerical column with mean
df['age'].fillna(df['age'].mean(), inplace=True)

# Using scikit-learn's SimpleImputer
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
df[['age']] = imputer.fit_transform(df[['age']])

Which Strategy to Choose?

  • Small dataset, few missing values → drop rows.
  • Large dataset, many missing values → impute.
  • Time series → forward/backward fill or interpolation.

Detect Missing Values First

print(df.isnull().sum()) # count missing per column
print(df.info()) # shows non‑null counts


Two Minute Drill
  • Missing data must be handled – models cannot process `NaN`.
  • Remove rows/columns with dropna() (if few missing).
  • Impute with mean, median, mode using fillna() or SimpleImputer.
  • Always check missing values with isnull().sum().

Need more clarification?

Drop us an email at career@quipoinfotech.com