PySpark Read and Write CSV and Parquet: Reliable IO Guide
Updated on
Data projects stall when files load incorrectly, schemas drift, or writes overwrite good data. PySpark solves these problems with consistent read/write options across CSV and Parquet, but the defaults can surprise teams handling varied datasets.
PySpark’s DataFrameReader/DataFrameWriter APIs provide predictable ingestion and export. By choosing the right format and save mode, adding schema control, and partitioning smartly, teams keep pipelines stable and performant.
Want to quickly create Data Visualization from Python Pandas Dataframe with No code?
PyGWalker is a Python library for Exploratory Data Analysis with Visualization. PyGWalker (opens in a new tab) can simplify your Jupyter Notebook data analysis and data visualization workflow, by turning your pandas dataframe (and polars dataframe) into a tableau-alternative User Interface for visual exploration.
CSV vs Parquet at a glance
| Topic | CSV | Parquet |
|---|---|---|
| Schema | Inferred unless provided; type drift common | Schema stored in file; stable types |
| Compression | None by default; larger files | Columnar + compression; smaller, faster |
| Read speed | Row-based; slower for wide scans | Columnar; faster column selection |
| Best for | Interchange with external tools | Analytics, repeat reads, partitioned lakes |
Quickstart setup
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("io-guide").getOrCreate()Reading CSV with control
df_csv = (
spark.read
.option("header", True)
.option("inferSchema", False)
.schema("id INT, name STRING, ts TIMESTAMP")
.option("delimiter", ",")
.option("mode", "DROPMALFORMED")
.csv("s3://bucket/raw/users/*.csv")
)headerreads column names from the first row.- Provide a schema to avoid type surprises and speed reads.
modechoices:PERMISSIVE(default),DROPMALFORMED,FAILFAST.
Reading Parquet
df_parquet = spark.read.parquet("s3://bucket/curated/users/")- Parquet stores schema and stats;
inferSchemais unnecessary. - Column pruning and predicate pushdown improve performance.
Writing CSV safely
(
df_csv
.repartition(1) # reduce output shards when needed
.write
.option("header", True)
.option("delimiter", ",")
.mode("overwrite")
.csv("s3://bucket/exports/users_csv")
)- Use
repartitionto control file counts; avoid single huge files. - Choose
modecarefully (see table below).
Writing Parquet with partitions
(
df_parquet
.write
.mode("append")
.partitionBy("country", "ingest_date")
.parquet("s3://bucket/warehouse/users_parquet")
)partitionBycreates directories for common filters to speed queries.- Keep partition keys low-cardinality to avoid many tiny files.
Save mode behaviors
| mode | Behavior |
|---|---|
error / errorifexists | Default; fail if path exists |
overwrite | Replace existing data at path |
append | Add new files to existing path |
ignore | Skip write if path exists |
Schema evolution tips
- Prefer Parquet when schema evolution is expected; leverage
mergeSchemafor compatible updates. - For CSV, version schemas explicitly and validate columns before writes.
- When adding columns, write Parquet with the new schema and ensure readers use the latest table definition.
Common pitfalls and fixes
- Mismatched delimiters: set
delimiterexplicitly; avoid relying on defaults. - Wrong types: supply
schemainstead ofinferSchema=Truefor production. - Too many small files: coalesce or repartition before write; consider
maxRecordsPerFile. - Overwrite accidents: stage outputs to a temp path, then atomically move if your storage layer supports it.
Minimal end-to-end recipe
schema = "user_id BIGINT, event STRING, ts TIMESTAMP, country STRING"
df = (
spark.read
.option("header", True)
.schema(schema)
.csv("s3://bucket/raw/events/*.csv")
)
cleaned = df.dropna(subset=["user_id", "event"]).withColumnRenamed("ts", "event_ts")
(
cleaned
.write
.mode("append")
.partitionBy("country")
.parquet("s3://bucket/warehouse/events_parquet")
)