K-Means Clustering

K‑means is an unsupervised clustering algorithm that groups data points into k clusters based on feature similarity. MLlib provides a distributed version.

Preparing Data for Clustering

No label column needed – only feature vectors.

assembler = VectorAssembler(inputCols=["age", "income"], outputCol="features")
df_features = assembler.transform(df)

Creating and Fitting K‑Means

from pyspark.ml.clustering import KMeans

kmeans = KMeans(featuresCol="features", k=3, seed=42)
model = kmeans.fit(df_features)
predictions = model.transform(df_features)
predictions.show()

Evaluating Clusters (Within Set Sum of Squared Errors)

wssse = model.summary.trainingCost
print(f"WSSSE: {wssse}")

Finding Optimal k (Elbow Method)

Loop over k values, compute WSSSE, and plot to find the elbow.

wssse_list = []
for k in range(2, 10):
    kmeans = KMeans(featuresCol="features", k=k, seed=42)
    model = kmeans.fit(df_features)
    wssse_list.append(model.summary.trainingCost)

Cluster Centers

centers = model.clusterCenters()
for idx, center in enumerate(centers):
    print(f"Cluster {idx}: {center}")

Two Minute Drill

K‑means groups data into k clusters without labels.
Use `KMeans` class with `k` parameter.
Evaluate with WSSSE (within‑set sum of squared errors).
Use elbow method to choose optimal k.

Need more clarification?

Drop us an email at career@quipoinfotech.com

Welcome to Quipoin

Quipoin Menu

K-Means Clustering

Need more clarification?