Skip to content

PySpark Read and Write CSV and Parquet: Reliable IO Guide

Updated on

Data projects stall when files load incorrectly, schemas drift, or writes overwrite good data. PySpark solves these problems with consistent read/write options across CSV and Parquet, but the defaults can surprise teams handling varied datasets.

PySpark’s DataFrameReader/DataFrameWriter APIs provide predictable ingestion and export. By choosing the right format and save mode, adding schema control, and partitioning smartly, teams keep pipelines stable and performant.

Want to quickly create Data Visualization from Python Pandas Dataframe with No code?

PyGWalker is a Python library for Exploratory Data Analysis with Visualization. PyGWalker (opens in a new tab) can simplify your Jupyter Notebook data analysis and data visualization workflow, by turning your pandas dataframe (and polars dataframe) into a tableau-alternative User Interface for visual exploration.

PyGWalker for Data visualization (opens in a new tab)

CSV vs Parquet at a glance

TopicCSVParquet
SchemaInferred unless provided; type drift commonSchema stored in file; stable types
CompressionNone by default; larger filesColumnar + compression; smaller, faster
Read speedRow-based; slower for wide scansColumnar; faster column selection
Best forInterchange with external toolsAnalytics, repeat reads, partitioned lakes

Quickstart setup

from pyspark.sql import SparkSession
 
spark = SparkSession.builder.appName("io-guide").getOrCreate()

Reading CSV with control

df_csv = (
    spark.read
    .option("header", True)
    .option("inferSchema", False)
    .schema("id INT, name STRING, ts TIMESTAMP")
    .option("delimiter", ",")
    .option("mode", "DROPMALFORMED")
    .csv("s3://bucket/raw/users/*.csv")
)
  • header reads column names from the first row.
  • Provide a schema to avoid type surprises and speed reads.
  • mode choices: PERMISSIVE (default), DROPMALFORMED, FAILFAST.

Reading Parquet

df_parquet = spark.read.parquet("s3://bucket/curated/users/")
  • Parquet stores schema and stats; inferSchema is unnecessary.
  • Column pruning and predicate pushdown improve performance.

Writing CSV safely

(
    df_csv
    .repartition(1)  # reduce output shards when needed
    .write
    .option("header", True)
    .option("delimiter", ",")
    .mode("overwrite")
    .csv("s3://bucket/exports/users_csv")
)
  • Use repartition to control file counts; avoid single huge files.
  • Choose mode carefully (see table below).

Writing Parquet with partitions

(
    df_parquet
    .write
    .mode("append")
    .partitionBy("country", "ingest_date")
    .parquet("s3://bucket/warehouse/users_parquet")
)
  • partitionBy creates directories for common filters to speed queries.
  • Keep partition keys low-cardinality to avoid many tiny files.

Save mode behaviors

modeBehavior
error / errorifexistsDefault; fail if path exists
overwriteReplace existing data at path
appendAdd new files to existing path
ignoreSkip write if path exists

Schema evolution tips

  • Prefer Parquet when schema evolution is expected; leverage mergeSchema for compatible updates.
  • For CSV, version schemas explicitly and validate columns before writes.
  • When adding columns, write Parquet with the new schema and ensure readers use the latest table definition.

Common pitfalls and fixes

  • Mismatched delimiters: set delimiter explicitly; avoid relying on defaults.
  • Wrong types: supply schema instead of inferSchema=True for production.
  • Too many small files: coalesce or repartition before write; consider maxRecordsPerFile.
  • Overwrite accidents: stage outputs to a temp path, then atomically move if your storage layer supports it.

Minimal end-to-end recipe

schema = "user_id BIGINT, event STRING, ts TIMESTAMP, country STRING"
 
df = (
    spark.read
    .option("header", True)
    .schema(schema)
    .csv("s3://bucket/raw/events/*.csv")
)
 
cleaned = df.dropna(subset=["user_id", "event"]).withColumnRenamed("ts", "event_ts")
 
(
    cleaned
    .write
    .mode("append")
    .partitionBy("country")
    .parquet("s3://bucket/warehouse/events_parquet")
)