Project 2: Sales Insights
In this project, you will analyze sales data to compute key metrics: total sales per product, top‑selling products, monthly revenue, and customer segmentation using aggregations and joins.
Project 2: Sales insights using groupBy, aggregations, and window functions.
Step 1: Load Sales Data
sales = spark.read.parquet("sales_data.parquet")
products = spark.read.parquet("products.parquet")
sales.show(5)
products.show(5)Step 2: Join Sales with Products
joined = sales.join(products, on="product_id", how="inner")
joined.select("product_name", "quantity", "price", "order_date").show(5)Step 3: Total Sales per Product
from pyspark.sql.functions import sum as spark_sum, round, col
product_sales = joined.groupBy("product_name").agg(spark_sum("quantity" * "price").alias("total_sales"))
product_sales = product_sales.withColumn("total_sales", round(col("total_sales"), 2))
product_sales.orderBy(col("total_sales").desc()).show(10)Step 4: Monthly Revenue (using window functions)
from pyspark.sql.functions import date_format
joined = joined.withColumn("month", date_format("order_date", "yyyy-MM"))
monthly_revenue = joined.groupBy("month").agg(spark_sum("quantity" * "price").alias("revenue"))
monthly_revenue.orderBy("month").show()Step 5: Customer Segmentation (RFM – Recency, Frequency, Monetary)
from pyspark.sql.functions import datediff, current_date
rfm = sales.groupBy("customer_id").agg(
datediff(current_date(), spark_sum("order_date")).alias("recency"),
spark_sum("quantity").alias("frequency"),
spark_sum("quantity" * "price").alias("monetary")
)
rfm.show(10)Two Minute Drill
- Join DataFrames to enrich sales data with product details.
- Use `groupBy` and `agg` for total sales per product.
- Use `date_format` to extract month for time‑based aggregations.
- Build RFM metrics for customer segmentation.
Need more clarification?
Drop us an email at career@quipoinfotech.com
