Joining DataFrames

Often data comes from multiple sources. Joining DataFrames combines them based on a common column. PySpark supports inner, left, right, outer, and cross joins.

Sample DataFrames

employees = spark.createDataFrame([(1,"Alice"),(2,"Bob")], ["id", "name"])
salaries = spark.createDataFrame([(1,50000),(3,60000)], ["id", "salary"])

Inner Join

inner = employees.join(salaries, on="id", how="inner")
inner.show()
# Only rows with id 1 appear

Left Join

left = employees.join(salaries, on="id", how="left")
left.show()
# All employees, salary null for id 2

Right Join

right = employees.join(salaries, on="id", how="right")
right.show()
# All salary rows, name null for id 3

Outer (Full) Join

outer = employees.join(salaries, on="id", how="outer")
outer.show()
# All rows from both, nulls where missing

Cross Join (Cartesian Product)

cross = employees.crossJoin(salaries)
# Use carefully – can be huge!

Two Minute Drill

Use `join()` with `how` parameter for different join types.
Inner: only matching rows.
Left/Right: keep all rows from one side.
Outer: keep all rows from both sides.
Cross join is expensive – avoid unless necessary.

Need more clarification?

Drop us an email at career@quipoinfotech.com

Welcome to Quipoin

Quipoin Menu

Joining DataFrames

Need more clarification?