PySpark Read and Write CSV and Parquet: Reliable IO Guide

Updated on 11/23/2025

Data projects stall when files load incorrectly, schemas drift, or writes overwrite good data. PySpark solves these problems with consistent read/write options across CSV and Parquet, but the defaults can surprise teams handling varied datasets.

PySpark’s DataFrameReader/DataFrameWriter APIs provide predictable ingestion and export. By choosing the right format and save mode, adding schema control, and partitioning smartly, teams keep pipelines stable and performant.

Want to quickly create Data Visualization from Python Pandas Dataframe with No code?

PyGWalker is a Python library for Exploratory Data Analysis with Visualization. PyGWalker (opens in a new tab) can simplify your Jupyter Notebook data analysis and data visualization workflow, by turning your pandas dataframe (and polars dataframe) into a tableau-alternative User Interface for visual exploration.

(opens in a new tab)

CSV vs Parquet at a glance

Topic	CSV	Parquet
Schema	Inferred unless provided; type drift common	Schema stored in file; stable types
Compression	None by default; larger files	Columnar + compression; smaller, faster
Read speed	Row-based; slower for wide scans	Columnar; faster column selection
Best for	Interchange with external tools	Analytics, repeat reads, partitioned lakes

Quickstart setup

from pyspark.sql import SparkSession
 
spark = SparkSession.builder.appName("io-guide").getOrCreate()

Reading CSV with control

df_csv = (
    spark.read
    .option("header", True)
    .option("inferSchema", False)
    .schema("id INT, name STRING, ts TIMESTAMP")
    .option("delimiter", ",")
    .option("mode", "DROPMALFORMED")
    .csv("s3://bucket/raw/users/*.csv")
)

header reads column names from the first row.
Provide a schema to avoid type surprises and speed reads.
mode choices: PERMISSIVE (default), DROPMALFORMED, FAILFAST.

Reading Parquet

df_parquet = spark.read.parquet("s3://bucket/curated/users/")

Parquet stores schema and stats; inferSchema is unnecessary.
Column pruning and predicate pushdown improve performance.

Writing CSV safely

(
    df_csv
    .repartition(1)  # reduce output shards when needed
    .write
    .option("header", True)
    .option("delimiter", ",")
    .mode("overwrite")
    .csv("s3://bucket/exports/users_csv")
)

Use repartition to control file counts; avoid single huge files.
Choose mode carefully (see table below).

Writing Parquet with partitions

(
    df_parquet
    .write
    .mode("append")
    .partitionBy("country", "ingest_date")
    .parquet("s3://bucket/warehouse/users_parquet")
)

partitionBy creates directories for common filters to speed queries.
Keep partition keys low-cardinality to avoid many tiny files.

Save mode behaviors

mode	Behavior
`error` / `errorifexists`	Default; fail if path exists
`overwrite`	Replace existing data at path
`append`	Add new files to existing path
`ignore`	Skip write if path exists

Schema evolution tips

Prefer Parquet when schema evolution is expected; leverage mergeSchema for compatible updates.
For CSV, version schemas explicitly and validate columns before writes.
When adding columns, write Parquet with the new schema and ensure readers use the latest table definition.

Common pitfalls and fixes

Mismatched delimiters: set delimiter explicitly; avoid relying on defaults.
Wrong types: supply schema instead of inferSchema=True for production.
Too many small files: coalesce or repartition before write; consider maxRecordsPerFile.
Overwrite accidents: stage outputs to a temp path, then atomically move if your storage layer supports it.

Minimal end-to-end recipe

schema = "user_id BIGINT, event STRING, ts TIMESTAMP, country STRING"
 
df = (
    spark.read
    .option("header", True)
    .schema(schema)
    .csv("s3://bucket/raw/events/*.csv")
)
 
cleaned = df.dropna(subset=["user_id", "event"]).withColumnRenamed("ts", "event_ts")
 
(
    cleaned
    .write
    .mode("append")
    .partitionBy("country")
    .parquet("s3://bucket/warehouse/events_parquet")
)