Model Evaluation & Tuning
Evaluating and tuning models is crucial for building reliable ML systems. MLlib provides evaluators and cross‑validation tools for hyperparameter tuning.
Evaluators for Different Tasks
- Regression: `RegressionEvaluator` (metrics: rmse, mae, r2).
- Binary classification: `BinaryClassificationEvaluator` (areaUnderROC, areaUnderPR).
- Multiclass classification: `MulticlassClassificationEvaluator` (accuracy, f1, precision, recall).
- Clustering: `ClusteringEvaluator` (silhouette score).
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)Cross‑Validation (Train‑Validation Split)
`TrainValidationSplit` performs a single train‑test split for hyperparameter tuning. For k‑fold, use `CrossValidator` (more expensive).
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit, CrossValidator
paramGrid = (ParamGridBuilder()
.addGrid(lr.regParam, [0.01, 0.1, 1.0])
.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])
.build())
tvs = TrainValidationSplit(estimator=lr,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
trainRatio=0.8)
cv_model = tvs.fit(train)Best Model and Parameters
best_model = cv_model.bestModel
best_params = cv_model.getEstimatorParamMaps()[cv_model.getBestIndex()]Two Minute Drill
- Evaluators measure performance for regression, classification, clustering.
- `ParamGridBuilder` defines hyperparameter search space.
- `TrainValidationSplit` and `CrossValidator` automate tuning.
- Always tune on training data; evaluate final model on test set.
Need more clarification?
Drop us an email at career@quipoinfotech.com
