Loading

Quipoin Menu

Learn • Practice • Grow

pyspark / Decision Trees & Random Forest
tutorial

Decision Trees & Random Forest

Decision trees and random forests are powerful non‑linear models for classification and regression. They are easy to interpret and handle both numerical and categorical features.

Decision Tree Classifier

from pyspark.ml.classification import DecisionTreeClassifier

dt = DecisionTreeClassifier(featuresCol="features", labelCol="label", maxDepth=5)
dt_model = dt.fit(train)
predictions = dt_model.transform(test)

Feature Importance (Decision Tree)

After training, you can see which features were most important:
importance = dt_model.featureImportances
print(importance)

Random Forest Classifier

Random forest builds multiple decision trees and averages their predictions, reducing overfitting.
from pyspark.ml.classification import RandomForestClassifier

rf = RandomForestClassifier(featuresCol="features", labelCol="label", numTrees=100, maxDepth=5)
rf_model = rf.fit(train)
predictions = rf_model.transform(test)

Evaluation (Using MulticlassMetrics)

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print(f"Accuracy: {accuracy}")


Two Minute Drill
  • Decision trees are interpretable but prone to overfitting.
  • Random forest averages many trees for better generalization.
  • Access feature importance with `.featureImportances`.
  • Use `MulticlassClassificationEvaluator` for accuracy.

Need more clarification?

Drop us an email at career@quipoinfotech.com