Feature Transformation
Before feeding data into an ML algorithm, you must transform raw columns into feature vectors. MLlib provides several feature transformers.
StringIndexer – Convert Categories to Numbers
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
df_indexed = indexer.fit(df).transform(df)OneHotEncoder – Create Binary Columns
from pyspark.ml.feature import OneHotEncoder
encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec")
df_encoded = encoder.transform(df_indexed)VectorAssembler – Combine Features into One Vector
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=["age", "income", "categoryVec"], outputCol="features")
df_features = assembler.transform(df_encoded)StandardScaler – Normalize Features
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures")
scaler_model = scaler.fit(df_features)
df_scaled = scaler_model.transform(df_features)Two Minute Drill
- StringIndexer converts categories to numbers.
- OneHotEncoder creates binary columns from indexed categories.
- VectorAssembler combines multiple columns into a single feature vector.
- StandardScaler normalizes numerical features.
Need more clarification?
Drop us an email at career@quipoinfotech.com
