Loading

Quipoin Menu

Learn • Practice • Grow

pyspark / Reading Data
tutorial

Reading Data

PySpark can read data from various formats: CSV, JSON, Parquet, text, and more. The `spark.read` method returns a DataFrame, which is the main data structure in PySpark.

Reading CSV

df = spark.read.csv("data.csv", header=True, inferSchema=True)
Parameters: `header=True` uses first row as column names, `inferSchema=True` automatically detects data types.

Reading JSON

df = spark.read.json("data.json")

Reading Parquet (Recommended for Performance)

Parquet is a columnar format that is faster and smaller.
df = spark.read.parquet("data.parquet")

Reading Multiple Files

Use a folder path or wildcards.
df = spark.read.csv("folder/*.csv", header=True)

Writing DataFrames to Files

df.write.csv("output.csv", header=True, mode="overwrite")
Mode can be `overwrite`, `append`, `ignore`, `error`.


Two Minute Drill
  • Use `spark.read.csv/json/parquet` to load data.
  • `header=True` and `inferSchema=True` for CSV.
  • Parquet is the most efficient format.
  • Write with `df.write.format().save()` or shorthand methods.

Need more clarification?

Drop us an email at career@quipoinfotech.com