Project 1: Data Cleaning
In this project, you will load a large CSV file, perform data cleaning operations (handling missing values, removing duplicates, standardizing formats), and save the cleaned dataset as Parquet for efficiency.
Project 1: Clean a real‑world dataset (e.g., sales or customer data) using PySpark.
Step 1: Load the Data
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("DataCleaning").getOrCreate()
df = spark.read.csv("large_dataset.csv", header=True, inferSchema=True)
df.printSchema()
df.show(5)Step 2: Handle Missing Values
from pyspark.sql.functions import col, mean
# Count missing per column
df.select([col(c).isNull().alias(c) for c in df.columns]).show()
# Fill numeric columns with mean
means = {c: df.select(mean(col(c))).collect()[0][0] for c in df.columns if df.schema[c].dataType.typeName() in ['integer', 'double']}
df_clean = df.fillna(means)
# Fill categorical with 'Unknown'
for c in df.columns:
if df.schema[c].dataType.typeName() == 'string':
df_clean = df_clean.fillna({c: "Unknown"})Step 3: Remove Duplicates
df_clean = df_clean.dropDuplicates()Step 4: Standardize Text Columns (e.g., trim, lower)
from pyspark.sql.functions import trim, lower
for c in df_clean.columns:
if df_clean.schema[c].dataType.typeName() == 'string':
df_clean = df_clean.withColumn(c, lower(trim(col(c))))Step 5: Save Cleaned Data
df_clean.write.parquet("cleaned_data.parquet", mode="overwrite")Two Minute Drill
- Load CSV with `inferSchema=True` for automatic types.
- Fill missing values using column means or constants.
- Drop duplicates with `dropDuplicates()`.
- Standardize text with `trim()` and `lower()`.
- Save cleaned data as Parquet for performance.
Need more clarification?
Drop us an email at career@quipoinfotech.com
