Decision Trees & Random Forest
Decision trees and random forests are powerful non‑linear models for classification and regression. They are easy to interpret and handle both numerical and categorical features.
Decision Tree Classifier
from pyspark.ml.classification import DecisionTreeClassifier
dt = DecisionTreeClassifier(featuresCol="features", labelCol="label", maxDepth=5)
dt_model = dt.fit(train)
predictions = dt_model.transform(test)Feature Importance (Decision Tree)
After training, you can see which features were most important:
importance = dt_model.featureImportances
print(importance)Random Forest Classifier
Random forest builds multiple decision trees and averages their predictions, reducing overfitting.
from pyspark.ml.classification import RandomForestClassifier
rf = RandomForestClassifier(featuresCol="features", labelCol="label", numTrees=100, maxDepth=5)
rf_model = rf.fit(train)
predictions = rf_model.transform(test)Evaluation (Using MulticlassMetrics)
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print(f"Accuracy: {accuracy}")Two Minute Drill
- Decision trees are interpretable but prone to overfitting.
- Random forest averages many trees for better generalization.
- Access feature importance with `.featureImportances`.
- Use `MulticlassClassificationEvaluator` for accuracy.
Need more clarification?
Drop us an email at career@quipoinfotech.com
