Joining DataFrames
Often data comes from multiple sources. Joining DataFrames combines them based on a common column. PySpark supports inner, left, right, outer, and cross joins.
Sample DataFrames
employees = spark.createDataFrame([(1,"Alice"),(2,"Bob")], ["id", "name"])
salaries = spark.createDataFrame([(1,50000),(3,60000)], ["id", "salary"])Inner Join
inner = employees.join(salaries, on="id", how="inner")
inner.show()
# Only rows with id 1 appearLeft Join
left = employees.join(salaries, on="id", how="left")
left.show()
# All employees, salary null for id 2Right Join
right = employees.join(salaries, on="id", how="right")
right.show()
# All salary rows, name null for id 3Outer (Full) Join
outer = employees.join(salaries, on="id", how="outer")
outer.show()
# All rows from both, nulls where missingCross Join (Cartesian Product)
cross = employees.crossJoin(salaries)
# Use carefully – can be huge!Two Minute Drill
- Use `join()` with `how` parameter for different join types.
- Inner: only matching rows.
- Left/Right: keep all rows from one side.
- Outer: keep all rows from both sides.
- Cross join is expensive – avoid unless necessary.
Need more clarification?
Drop us an email at career@quipoinfotech.com
