Loading

Quipoin Menu

Learn • Practice • Grow

pyspark / Project 2: Sales Insights
tutorial

Project 2: Sales Insights

In this project, you will analyze sales data to compute key metrics: total sales per product, top‑selling products, monthly revenue, and customer segmentation using aggregations and joins.

Project 2: Sales insights using groupBy, aggregations, and window functions.

Step 1: Load Sales Data

sales = spark.read.parquet("sales_data.parquet")
products = spark.read.parquet("products.parquet")

sales.show(5)
products.show(5)

Step 2: Join Sales with Products

joined = sales.join(products, on="product_id", how="inner")
joined.select("product_name", "quantity", "price", "order_date").show(5)

Step 3: Total Sales per Product

from pyspark.sql.functions import sum as spark_sum, round, col

product_sales = joined.groupBy("product_name").agg(spark_sum("quantity" * "price").alias("total_sales"))
product_sales = product_sales.withColumn("total_sales", round(col("total_sales"), 2))
product_sales.orderBy(col("total_sales").desc()).show(10)

Step 4: Monthly Revenue (using window functions)

from pyspark.sql.functions import date_format

joined = joined.withColumn("month", date_format("order_date", "yyyy-MM"))
monthly_revenue = joined.groupBy("month").agg(spark_sum("quantity" * "price").alias("revenue"))
monthly_revenue.orderBy("month").show()

Step 5: Customer Segmentation (RFM – Recency, Frequency, Monetary)

from pyspark.sql.functions import datediff, current_date

rfm = sales.groupBy("customer_id").agg(
datediff(current_date(), spark_sum("order_date")).alias("recency"),
spark_sum("quantity").alias("frequency"),
spark_sum("quantity" * "price").alias("monetary")
)
rfm.show(10)


Two Minute Drill
  • Join DataFrames to enrich sales data with product details.
  • Use `groupBy` and `agg` for total sales per product.
  • Use `date_format` to extract month for time‑based aggregations.
  • Build RFM metrics for customer segmentation.

Need more clarification?

Drop us an email at career@quipoinfotech.com