K-Means Clustering
K‑means is an unsupervised clustering algorithm that groups data points into k clusters based on feature similarity. MLlib provides a distributed version.
Preparing Data for Clustering
No label column needed – only feature vectors.
assembler = VectorAssembler(inputCols=["age", "income"], outputCol="features")
df_features = assembler.transform(df)Creating and Fitting K‑Means
from pyspark.ml.clustering import KMeans
kmeans = KMeans(featuresCol="features", k=3, seed=42)
model = kmeans.fit(df_features)
predictions = model.transform(df_features)
predictions.show()Evaluating Clusters (Within Set Sum of Squared Errors)
wssse = model.summary.trainingCost
print(f"WSSSE: {wssse}")Finding Optimal k (Elbow Method)
Loop over k values, compute WSSSE, and plot to find the elbow.
wssse_list = []
for k in range(2, 10):
kmeans = KMeans(featuresCol="features", k=k, seed=42)
model = kmeans.fit(df_features)
wssse_list.append(model.summary.trainingCost)Cluster Centers
centers = model.clusterCenters()
for idx, center in enumerate(centers):
print(f"Cluster {idx}: {center}")Two Minute Drill
- K‑means groups data into k clusters without labels.
- Use `KMeans` class with `k` parameter.
- Evaluate with WSSSE (within‑set sum of squared errors).
- Use elbow method to choose optimal k.
Need more clarification?
Drop us an email at career@quipoinfotech.com
