Loading

Quipoin Menu

Learn • Practice • Grow

pyspark / Logistic Regression
tutorial

Logistic Regression

Logistic regression is a classification algorithm used for binary or multiclass prediction. MLlib implements it efficiently on large datasets.

Preparing Data

Assume you already have a DataFrame with a `features` column (VectorAssembler) and a `label` column.
# Split data
train, test = df.randomSplit([0.7, 0.3], seed=42)

Create and Fit Logistic Regression Model

from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(featuresCol="features", labelCol="label")
lr_model = lr.fit(train)

Make Predictions

predictions = lr_model.transform(test)
predictions.select("features", "label", "prediction").show()

Model Evaluation

from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator(labelCol="label", metricName="areaUnderROC")
auc = evaluator.evaluate(predictions)
print(f"AUC: {auc}")

Hyperparameters

You can adjust `regParam` (regularization), `elasticNetParam`, and `maxIter` to improve performance.


Two Minute Drill
  • LogisticRegression is for binary/multiclass classification.
  • Use `randomSplit` to create train/test sets.
  • Evaluate with `BinaryClassificationEvaluator` for AUC.
  • Tune `regParam`, `maxIter`, etc. for better performance.

Need more clarification?

Drop us an email at career@quipoinfotech.com