Handling Categorical Data
Machine learning models understand only numbers. Text categories like "red", "blue", "green" or "low", "medium", "high" must be converted to numeric form. This chapter covers the two main techniques: label encoding and one‑hot encoding.
Label Encoding
Assign each unique category an integer (0,1,2,...).
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['color_encoded'] = le.fit_transform(df['color'])
# 'red' → 0, 'blue' → 1, 'green' → 2Use when categories have an order (ordinal), e.g., "low" < "medium" < "high". Avoid for nominal categories (no order) because models might assume 0 < 1 < 2.One‑Hot Encoding
Create a new binary column for each category. Each column is 1 if the category is present, else 0.
# Using pandas get_dummies
df_encoded = pd.get_dummies(df, columns=['color'])
# Using scikit-learn
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
encoded = encoder.fit_transform(df[['color']])Best for nominal categories without order. Creates multiple columns – may increase dataset size.When to Use Which?
| Scenario | Recommended |
|---|---|
| Ordinal categories (small, medium, large) | Label Encoding |
| Nominal categories (colors, countries) | One‑Hot Encoding |
| High cardinality (many unique values) | Frequency encoding or embeddings |
Example: Predicting House Prices
Neighborhood (nominal) → one‑hot encode. Quality rating (poor, average, good) → label encode.
Two Minute Drill
- Categorical data must be numeric for ML models.
- Label encoding assigns integers (for ordinal data).
- One‑hot encoding creates binary columns (for nominal data).
- Use
pd.get_dummies()orOneHotEncoder.
Need more clarification?
Drop us an email at career@quipoinfotech.com
