Loading

Quipoin Menu

Learn • Practice • Grow

python-for-ai / Datasets and Train-Test Split
tutorial

Datasets and Train-Test Split

Before building a model, you need data and a way to evaluate it. Scikit-learn provides built‑in datasets and a function to split data into training and testing sets.

Built‑in Datasets

from sklearn.datasets import load_iris, load_diabetes, load_digits

# Classification dataset (iris flowers)
iris = load_iris()
X, y = iris.data, iris.target # features and labels

# Regression dataset (diabetes progression)
diabetes = load_diabetes()
X_d, y_d = diabetes.data, diabetes.target

Train/Test Split

To evaluate your model, you need unseen data. Split your dataset into training (to learn) and testing (to evaluate).
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)

# 80% train, 20% test

Why Split?

  • Training set: Used to fit the model (learn patterns).
  • Testing set: Used to evaluate how well the model generalizes to new data.
  • Never use test data during training – it would give overly optimistic results.

Exploring the Data

print(X.shape) # number of samples, features
print(y.shape) # number of labels
print(iris.feature_names) # column names


Two Minute Drill
  • Built‑in datasets: load_iris(), load_diabetes(), load_digits().
  • Use train_test_split(X, y, test_size=0.2).
  • Training set learns, testing set evaluates.
  • Always set random_state for reproducibility.

Need more clarification?

Drop us an email at career@quipoinfotech.com