Datasets and Train-Test Split Interview Questions

Q1. Scenario: You have a dataset of 1000 samples. Split it into training (80%) and testing (20%) using train_test_split. Set random_state for reproducibility.

from sklearn.model_selection import train_test_split; X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42). random_state ensures same split each run. Stratify parameter for class balance.

Q2. Scenario: Why do we need separate training and test sets? What happens if we evaluate the model on the same data it was trained on?

Training on all and evaluating on same data leads to overfitting: the model memorizes the data and fails on unseen data. Test set provides unbiased estimate of generalization performance. Validating prevents overfitting.

Q3. Scenario: For small datasets, a single train-test split might be unreliable. What technique can you use? Use cross-validation with k-fold.

from sklearn.model_selection import cross_val_score, KFold; kfold = KFold(n_splits=5, shuffle=True, random_state=42); scores = cross_val_score(model, X, y, cv=kfold). This gives more robust performance estimate.

Q4. Scenario: Load the diabetes dataset and split into train/test. Use StratifiedKFold if classification.

from sklearn.datasets import load_diabetes; diabetes = load_diabetes(); X, y = diabetes.data, diabetes.target; X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42). For classification, add stratify=y.

Q5. Scenario: Split the data into training, validation, and test sets (e.g., 60/20/20). Use train_test_split twice.

X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2); X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25). 0.25 of 0.8 is 0.2 of total.

Welcome to Quipoin

Quipoin Menu