Loading

Quipoin Menu

Learn • Practice • Grow

pyspark / Feature Transformation
tutorial

Feature Transformation

Before feeding data into an ML algorithm, you must transform raw columns into feature vectors. MLlib provides several feature transformers.

StringIndexer – Convert Categories to Numbers

from pyspark.ml.feature import StringIndexer

indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
df_indexed = indexer.fit(df).transform(df)

OneHotEncoder – Create Binary Columns

from pyspark.ml.feature import OneHotEncoder

encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec")
df_encoded = encoder.transform(df_indexed)

VectorAssembler – Combine Features into One Vector

from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols=["age", "income", "categoryVec"], outputCol="features")
df_features = assembler.transform(df_encoded)

StandardScaler – Normalize Features

from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures")
scaler_model = scaler.fit(df_features)
df_scaled = scaler_model.transform(df_features)


Two Minute Drill
  • StringIndexer converts categories to numbers.
  • OneHotEncoder creates binary columns from indexed categories.
  • VectorAssembler combines multiple columns into a single feature vector.
  • StandardScaler normalizes numerical features.

Need more clarification?

Drop us an email at career@quipoinfotech.com