Reading Data
PySpark can read data from various formats: CSV, JSON, Parquet, text, and more. The `spark.read` method returns a DataFrame, which is the main data structure in PySpark.
Reading CSV
df = spark.read.csv("data.csv", header=True, inferSchema=True)Parameters: `header=True` uses first row as column names, `inferSchema=True` automatically detects data types.Reading JSON
df = spark.read.json("data.json")Reading Parquet (Recommended for Performance)
Parquet is a columnar format that is faster and smaller.
df = spark.read.parquet("data.parquet")Reading Multiple Files
Use a folder path or wildcards.
df = spark.read.csv("folder/*.csv", header=True)Writing DataFrames to Files
df.write.csv("output.csv", header=True, mode="overwrite")Mode can be `overwrite`, `append`, `ignore`, `error`.Two Minute Drill
- Use `spark.read.csv/json/parquet` to load data.
- `header=True` and `inferSchema=True` for CSV.
- Parquet is the most efficient format.
- Write with `df.write.format().save()` or shorthand methods.
Need more clarification?
Drop us an email at career@quipoinfotech.com
