Reading Data

PySpark can read data from various formats: CSV, JSON, Parquet, text, and more. The `spark.read` method returns a DataFrame, which is the main data structure in PySpark.

Reading CSV

df = spark.read.csv("data.csv", header=True, inferSchema=True)

Parameters: `header=True` uses first row as column names, `inferSchema=True` automatically detects data types.

Reading JSON

df = spark.read.json("data.json")

Reading Parquet (Recommended for Performance)

Parquet is a columnar format that is faster and smaller.

df = spark.read.parquet("data.parquet")

Reading Multiple Files

Use a folder path or wildcards.

df = spark.read.csv("folder/*.csv", header=True)

Writing DataFrames to Files

df.write.csv("output.csv", header=True, mode="overwrite")

Mode can be `overwrite`, `append`, `ignore`, `error`.

Two Minute Drill

Use `spark.read.csv/json/parquet` to load data.
`header=True` and `inferSchema=True` for CSV.
Parquet is the most efficient format.
Write with `df.write.format().save()` or shorthand methods.

Need more clarification?

Drop us an email at career@quipoinfotech.com

Welcome to Quipoin

Quipoin Menu

Reading Data

Need more clarification?