Cross-Validation
Cross‑validation is a robust technique for evaluating model performance by splitting data into multiple training/validation folds. It reduces the variance of performance estimates and helps detect overfitting.
K‑fold cross‑validation divides data into k folds, trains on k‑1 folds, validates on the remaining fold, and repeats.
K‑Fold Cross‑Validation (Standard)
1. Shuffle and split data into k equal folds.
2. For each fold i: train on all folds except i, validate on fold i.
3. Average the k validation scores.
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5) # 5-fold
print(f"Mean accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")Stratified K‑Fold (for Classification)
Preserves class proportions in each fold. Use for imbalanced datasets.
from sklearn.model_selection import StratifiedKFold, cross_val_score
skf = StratifiedKFold(n_splits=5)
scores = cross_val_score(model, X, y, cv=skf)When to Use Which?
- K‑fold: default choice, works well for most cases.
- Stratified K‑fold: for classification with imbalanced classes.
- Leave‑One‑Out (LOO): k = n, very computationally expensive.
- TimeSeriesSplit: for time series data (no shuffling).
Two Minute Drill
- Cross‑validation reduces variance in performance estimates.
- K‑fold (k=5 or 10) is standard.
- Stratified K‑fold preserves class balance.
- Use
cross_val_scorefor easy implementation.
Need more clarification?
Drop us an email at career@quipoinfotech.com
