Handling Missing Data
Missing data is a common problem. A dataset may have blank cells, `NaN`, or placeholders like `?`. ML models cannot handle missing values – you must address them before training.
How Missing Data Occurs
- Human error during data entry.
- Sensor failure or data collection issues.
- Incomplete records (e.g., survey participants skip questions).
Option 1: Remove Missing Values
import pandas as pd
df = pd.read_csv('data.csv')
# Drop rows with any missing value
df_clean = df.dropna()
# Drop columns with missing values
df_clean = df.dropna(axis=1)Use when missing data is few (less than 5%). Losing too many rows wastes information.Option 2: Impute (Fill) Missing Values
Replace missing values with a sensible estimate:
- Mean/median for numerical features.
- Mode (most frequent) for categorical features.
- Forward fill / backward fill for time series.
# Fill numerical column with mean
df['age'].fillna(df['age'].mean(), inplace=True)
# Using scikit-learn's SimpleImputer
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
df[['age']] = imputer.fit_transform(df[['age']])Which Strategy to Choose?
- Small dataset, few missing values → drop rows.
- Large dataset, many missing values → impute.
- Time series → forward/backward fill or interpolation.
Detect Missing Values First
print(df.isnull().sum()) # count missing per column
print(df.info()) # shows non‑null countsTwo Minute Drill
- Missing data must be handled – models cannot process `NaN`.
- Remove rows/columns with
dropna()(if few missing). - Impute with mean, median, mode using
fillna()orSimpleImputer. - Always check missing values with
isnull().sum().
Need more clarification?
Drop us an email at career@quipoinfotech.com
