Project 1: Data Cleaning

In this project, you will load a large CSV file, perform data cleaning operations (handling missing values, removing duplicates, standardizing formats), and save the cleaned dataset as Parquet for efficiency.

Project 1: Clean a real‑world dataset (e.g., sales or customer data) using PySpark.

Step 1: Load the Data

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DataCleaning").getOrCreate()

df = spark.read.csv("large_dataset.csv", header=True, inferSchema=True)
df.printSchema()
df.show(5)

Step 2: Handle Missing Values

from pyspark.sql.functions import col, mean

# Count missing per column
df.select([col(c).isNull().alias(c) for c in df.columns]).show()

# Fill numeric columns with mean
means = {c: df.select(mean(col(c))).collect()[0][0] for c in df.columns if df.schema[c].dataType.typeName() in ['integer', 'double']}
df_clean = df.fillna(means)

# Fill categorical with 'Unknown'
for c in df.columns:
    if df.schema[c].dataType.typeName() == 'string':
        df_clean = df_clean.fillna({c: "Unknown"})

Step 3: Remove Duplicates

df_clean = df_clean.dropDuplicates()

Step 4: Standardize Text Columns (e.g., trim, lower)

from pyspark.sql.functions import trim, lower

for c in df_clean.columns:
    if df_clean.schema[c].dataType.typeName() == 'string':
        df_clean = df_clean.withColumn(c, lower(trim(col(c))))

Step 5: Save Cleaned Data

df_clean.write.parquet("cleaned_data.parquet", mode="overwrite")

Two Minute Drill

Load CSV with `inferSchema=True` for automatic types.
Fill missing values using column means or constants.
Drop duplicates with `dropDuplicates()`.
Standardize text with `trim()` and `lower()`.
Save cleaned data as Parquet for performance.

Need more clarification?

Drop us an email at career@quipoinfotech.com

Welcome to Quipoin

Quipoin Menu

Project 1: Data Cleaning

Need more clarification?