Project 3: Predictive Model
In this project, you will build a predictive model (logistic regression) to classify customer churn using MLlib. You will go through feature engineering, training, evaluation, and saving the model.
Project 3: Churn prediction with logistic regression using PySpark MLlib.
Step 1: Load and Prepare Data
from pyspark.ml.feature import VectorAssembler, StringIndexer
df = spark.read.csv("customer_data.csv", header=True, inferSchema=True)
# Convert categorical columns
indexer = StringIndexer(inputCol="gender", outputCol="genderIndex")
df = indexer.fit(df).transform(df)Step 2: Assemble Features
feature_cols = ["age", "tenure", "monthly_charges", "total_charges", "genderIndex"]
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
df = assembler.transform(df)
df = df.select("features", "churn")
df = df.withColumnRenamed("churn", "label")Step 3: Train/Test Split
train, test = df.randomSplit([0.7, 0.3], seed=42)Step 4: Train Logistic Regression
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(featuresCol="features", labelCol="label")
lr_model = lr.fit(train)Step 5: Evaluate
from pyspark.ml.evaluation import BinaryClassificationEvaluator
predictions = lr_model.transform(test)
evaluator = BinaryClassificationEvaluator(labelCol="label", metricName="areaUnderROC")
auc = evaluator.evaluate(predictions)
print(f"AUC: {auc}")Step 6: Save Model
lr_model.save("churn_model")Two Minute Drill
- StringIndexer converts categorical columns to numeric indices.
- VectorAssembler combines feature columns into a single vector.
- Train logistic regression model using MLlib.
- Evaluate with AUC (area under ROC curve).
- Save model for later use.
Need more clarification?
Drop us an email at career@quipoinfotech.com
