Data Cleaning
Real‑world data is messy – it contains missing values, duplicates, and incorrect formats. Cleaning data is often 80% of the work in AI projects. Pandas provides powerful tools to handle these issues.
Handling Missing Values
Missing values are usually represented as
NaN (Not a Number).import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1,2,np.nan,4],
'B': [5,np.nan,np.nan,8]
})
# Detect missing values
print(df.isnull())
# Drop rows with any missing values
df_clean = df.dropna()
# Fill missing values with a specific value
df_filled = df.fillna(0)
# Fill with column mean
df['A'] = df['A'].fillna(df['A'].mean())Removing Duplicates
df = pd.DataFrame({'Name': ['A','B','A','C'], 'Value': [1,2,1,4]})
df_no_dup = df.drop_duplicates()Changing Data Types
df['Age'] = df['Age'].astype(int)Renaming Columns
df.rename(columns={'OldName': 'NewName'}, inplace=True)Why Cleaning Matters for AI
Most ML algorithms cannot handle missing values or strings directly. Cleaning ensures your data is numeric, complete, and ready for training.
Two Minute Drill
- Detect missing:
df.isnull(). - Remove rows with missing:
df.dropna(). - Fill missing:
df.fillna(value). - Remove duplicates:
df.drop_duplicates().
Need more clarification?
Drop us an email at career@quipoinfotech.com
