Loading

Quipoin Menu

Learn • Practice • Grow

machine-learning / Data Splitting
tutorial

Data Splitting

You cannot evaluate a model on the same data used to train it – the model would simply memorize the answers. Data splitting divides your dataset into separate sets for training, validation, and testing.

Training set: teaches the model. Test set: evaluates final performance. Validation set: tunes hyperparameters.

Standard Split: Train / Test

Typically 70‑80% for training, 20‑30% for testing. The model never sees the test set until the end.
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
random_state ensures reproducibility.

Train / Validation / Test Split

When you need to tune hyperparameters, split into three: training (60%), validation (20%), test (20%).
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5)

Why Not Use All Data for Training?

  • Overfitting: Model memorizes instead of learning patterns.
  • Unrealistic performance: Test on unseen data simulates real‑world.
  • Hyperparameter tuning: Validation set prevents peeking at test set.

Important: Shuffle Before Splitting

Always shuffle data unless it is time series. `train_test_split` shuffles by default. This prevents order bias (e.g., all early samples being one class).


Two Minute Drill
  • Split data into train (60‑80%) and test (20‑40%).
  • Use validation set for hyperparameter tuning.
  • train_test_split from scikit‑learn.
  • Set random_state for reproducible splits.

Need more clarification?

Drop us an email at career@quipoinfotech.com