Feature Scaling
Features often have different scales. For example, age (0‑100) vs. income (20,000‑200,000). Distance‑based algorithms (k‑NN, SVM, neural networks) are sensitive to scale – larger numbers can dominate smaller ones. Feature scaling brings all features to a similar range.
Normalization (Min‑Max Scaling)
Scales values to a fixed range, usually [0, 1]. Formula: `(x – min) / (max – min)`.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_scaled = scaler.fit_transform(df[['age', 'income']])Useful when features have bounded ranges (e.g., pixel intensities 0‑255).Standardization (Z‑Score)
Centers data to mean 0 and standard deviation 1. Formula: `(x – mean) / std`.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[['age', 'income']])Preferred for many algorithms (linear regression, SVM, PCA). Does not assume bounded range.Which One to Choose?
- Normalization (MinMaxScaler): When you need values in a fixed range (e.g., neural networks with sigmoid activation).
- Standardization (StandardScaler): When features have outliers or you use PCA, SVM, linear regression.
- For tree‑based models (Random Forest, XGBoost), scaling is not needed because they are not distance‑based.
Why Scaling Matters
Without scaling, a feature with large values (e.g., income) would dominate distance calculations, even if it is less important than age. Scaling ensures each feature contributes proportionally.
Two Minute Drill
- Feature scaling brings all features to similar ranges.
- Normalization (MinMaxScaler) → [0,1] range.
- Standardization (StandardScaler) → mean 0, std 1.
- Tree‑based models do not require scaling.
Need more clarification?
Drop us an email at career@quipoinfotech.com
