Loading

Quipoin Menu

Learn • Practice • Grow

pyspark / Project 3: Predictive Model
tutorial

Project 3: Predictive Model

In this project, you will build a predictive model (logistic regression) to classify customer churn using MLlib. You will go through feature engineering, training, evaluation, and saving the model.

Project 3: Churn prediction with logistic regression using PySpark MLlib.

Step 1: Load and Prepare Data

from pyspark.ml.feature import VectorAssembler, StringIndexer

df = spark.read.csv("customer_data.csv", header=True, inferSchema=True)

# Convert categorical columns
indexer = StringIndexer(inputCol="gender", outputCol="genderIndex")
df = indexer.fit(df).transform(df)

Step 2: Assemble Features

feature_cols = ["age", "tenure", "monthly_charges", "total_charges", "genderIndex"]
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
df = assembler.transform(df)
df = df.select("features", "churn")
df = df.withColumnRenamed("churn", "label")

Step 3: Train/Test Split

train, test = df.randomSplit([0.7, 0.3], seed=42)

Step 4: Train Logistic Regression

from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(featuresCol="features", labelCol="label")
lr_model = lr.fit(train)

Step 5: Evaluate

from pyspark.ml.evaluation import BinaryClassificationEvaluator

predictions = lr_model.transform(test)
evaluator = BinaryClassificationEvaluator(labelCol="label", metricName="areaUnderROC")
auc = evaluator.evaluate(predictions)
print(f"AUC: {auc}")

Step 6: Save Model

lr_model.save("churn_model")


Two Minute Drill
  • StringIndexer converts categorical columns to numeric indices.
  • VectorAssembler combines feature columns into a single vector.
  • Train logistic regression model using MLlib.
  • Evaluate with AUC (area under ROC curve).
  • Save model for later use.

Need more clarification?

Drop us an email at career@quipoinfotech.com