Loading

Quipoin Menu

Learn • Practice • Grow

python-for-ai / Data Cleaning
tutorial

Data Cleaning

Real‑world data is messy – it contains missing values, duplicates, and incorrect formats. Cleaning data is often 80% of the work in AI projects. Pandas provides powerful tools to handle these issues.

Handling Missing Values

Missing values are usually represented as NaN (Not a Number).
import pandas as pd
import numpy as np

df = pd.DataFrame({
'A': [1,2,np.nan,4],
'B': [5,np.nan,np.nan,8]
})

# Detect missing values
print(df.isnull())

# Drop rows with any missing values
df_clean = df.dropna()

# Fill missing values with a specific value
df_filled = df.fillna(0)

# Fill with column mean
df['A'] = df['A'].fillna(df['A'].mean())

Removing Duplicates

df = pd.DataFrame({'Name': ['A','B','A','C'], 'Value': [1,2,1,4]})
df_no_dup = df.drop_duplicates()

Changing Data Types

df['Age'] = df['Age'].astype(int)

Renaming Columns

df.rename(columns={'OldName': 'NewName'}, inplace=True)

Why Cleaning Matters for AI

Most ML algorithms cannot handle missing values or strings directly. Cleaning ensures your data is numeric, complete, and ready for training.


Two Minute Drill
  • Detect missing: df.isnull().
  • Remove rows with missing: df.dropna().
  • Fill missing: df.fillna(value).
  • Remove duplicates: df.drop_duplicates().

Need more clarification?

Drop us an email at career@quipoinfotech.com