Data Splitting
You cannot evaluate a model on the same data used to train it – the model would simply memorize the answers. Data splitting divides your dataset into separate sets for training, validation, and testing.
Training set: teaches the model. Test set: evaluates final performance. Validation set: tunes hyperparameters.
Standard Split: Train / Test
Typically 70‑80% for training, 20‑30% for testing. The model never sees the test set until the end.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)random_state ensures reproducibility.Train / Validation / Test Split
When you need to tune hyperparameters, split into three: training (60%), validation (20%), test (20%).
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5)Why Not Use All Data for Training?
- Overfitting: Model memorizes instead of learning patterns.
- Unrealistic performance: Test on unseen data simulates real‑world.
- Hyperparameter tuning: Validation set prevents peeking at test set.
Important: Shuffle Before Splitting
Always shuffle data unless it is time series. `train_test_split` shuffles by default. This prevents order bias (e.g., all early samples being one class).
Two Minute Drill
- Split data into train (60‑80%) and test (20‑40%).
- Use validation set for hyperparameter tuning.
train_test_splitfrom scikit‑learn.- Set
random_statefor reproducible splits.
Need more clarification?
Drop us an email at career@quipoinfotech.com
