Datasets and Train-Test Split
Before building a model, you need data and a way to evaluate it. Scikit-learn provides built‑in datasets and a function to split data into training and testing sets.
Built‑in Datasets
from sklearn.datasets import load_iris, load_diabetes, load_digits
# Classification dataset (iris flowers)
iris = load_iris()
X, y = iris.data, iris.target # features and labels
# Regression dataset (diabetes progression)
diabetes = load_diabetes()
X_d, y_d = diabetes.data, diabetes.targetTrain/Test Split
To evaluate your model, you need unseen data. Split your dataset into training (to learn) and testing (to evaluate).
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# 80% train, 20% testWhy Split?
- Training set: Used to fit the model (learn patterns).
- Testing set: Used to evaluate how well the model generalizes to new data.
- Never use test data during training – it would give overly optimistic results.
Exploring the Data
print(X.shape) # number of samples, features
print(y.shape) # number of labels
print(iris.feature_names) # column namesTwo Minute Drill
- Built‑in datasets:
load_iris(),load_diabetes(),load_digits(). - Use
train_test_split(X, y, test_size=0.2). - Training set learns, testing set evaluates.
- Always set
random_statefor reproducibility.
Need more clarification?
Drop us an email at career@quipoinfotech.com
