Loading

Quipoin Menu

Learn • Practice • Grow

pyspark / K-Means Clustering
tutorial

K-Means Clustering

K‑means is an unsupervised clustering algorithm that groups data points into k clusters based on feature similarity. MLlib provides a distributed version.

Preparing Data for Clustering

No label column needed – only feature vectors.
assembler = VectorAssembler(inputCols=["age", "income"], outputCol="features")
df_features = assembler.transform(df)

Creating and Fitting K‑Means

from pyspark.ml.clustering import KMeans

kmeans = KMeans(featuresCol="features", k=3, seed=42)
model = kmeans.fit(df_features)
predictions = model.transform(df_features)
predictions.show()

Evaluating Clusters (Within Set Sum of Squared Errors)

wssse = model.summary.trainingCost
print(f"WSSSE: {wssse}")

Finding Optimal k (Elbow Method)

Loop over k values, compute WSSSE, and plot to find the elbow.
wssse_list = []
for k in range(2, 10):
kmeans = KMeans(featuresCol="features", k=k, seed=42)
model = kmeans.fit(df_features)
wssse_list.append(model.summary.trainingCost)

Cluster Centers

centers = model.clusterCenters()
for idx, center in enumerate(centers):
print(f"Cluster {idx}: {center}")


Two Minute Drill
  • K‑means groups data into k clusters without labels.
  • Use `KMeans` class with `k` parameter.
  • Evaluate with WSSSE (within‑set sum of squared errors).
  • Use elbow method to choose optimal k.

Need more clarification?

Drop us an email at career@quipoinfotech.com