Logistic Regression
Logistic regression is a classification algorithm used for binary or multiclass prediction. MLlib implements it efficiently on large datasets.
Preparing Data
Assume you already have a DataFrame with a `features` column (VectorAssembler) and a `label` column.
# Split data
train, test = df.randomSplit([0.7, 0.3], seed=42)Create and Fit Logistic Regression Model
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(featuresCol="features", labelCol="label")
lr_model = lr.fit(train)Make Predictions
predictions = lr_model.transform(test)
predictions.select("features", "label", "prediction").show()Model Evaluation
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(labelCol="label", metricName="areaUnderROC")
auc = evaluator.evaluate(predictions)
print(f"AUC: {auc}")Hyperparameters
You can adjust `regParam` (regularization), `elasticNetParam`, and `maxIter` to improve performance.
Two Minute Drill
- LogisticRegression is for binary/multiclass classification.
- Use `randomSplit` to create train/test sets.
- Evaluate with `BinaryClassificationEvaluator` for AUC.
- Tune `regParam`, `maxIter`, etc. for better performance.
Need more clarification?
Drop us an email at career@quipoinfotech.com
