Pandas read_csv: The Definitive Guide to pd.read_csv() in Python
Updated on
Every data analysis project starts with loading data. In the Python ecosystem, pd.read_csv() is the single most-used function for bringing external data into a pandas DataFrame. Whether you are importing a 50-row lookup table or streaming through a 20 GB log file, pandas.read_csv() has parameters to handle it.
The problem is that read_csv() has over 50 parameters, and the defaults do not always match your data. Wrong encoding turns text into garbage. Uncontrolled type inference silently converts IDs to floats. Skipping the usecols parameter loads columns you never need, wasting memory and time. These issues add up fast in production pipelines.
This guide covers the parameters that matter most, with working code examples, performance benchmarks, and solutions to every common error. All examples target Pandas 2.0+ and are tested against the latest stable release.
Basic Usage of pd.read_csv()
At its simplest, pd.read_csv() takes a file path and returns a DataFrame:
import pandas as pd
df = pd.read_csv("sales_data.csv")
print(df.shape)
df.head()This single line handles comma detection, header inference, and basic type guessing. For clean CSV files, it works out of the box. For everything else, you need the parameters described below.
Complete Parameters Reference
The table below covers the 20 most useful parameters. These handle roughly 95% of real-world CSV loading scenarios.
| Parameter | Type | Default | Description |
|---|---|---|---|
filepath_or_buffer | str / path / file-like | Required | Path to the CSV file, URL, or file-like object |
sep | str | ',' | Column delimiter. Use '\t' for TSV, ';' for European CSV |
header | int / list of int / None | 0 (first row) | Row number(s) to use as column names |
names | list | None | Explicit column names. Overrides header row |
index_col | int / str / list / False | None | Column(s) to set as the DataFrame index |
usecols | list / callable | None | Subset of columns to load. Reduces memory significantly |
dtype | dict / type | None | Data types for columns. Prevents unwanted inference |
converters | dict | None | Functions to apply to specific columns during parsing |
parse_dates | bool / list / dict | False | Columns to parse as datetime objects |
date_format | str / dict | None | Datetime format string (Pandas 2.0+). Replaces deprecated date_parser |
na_values | str / list / dict | None | Additional strings to treat as NaN |
keep_default_na | bool | True | Whether to include default NaN markers (empty string, NA, NULL, etc.) |
encoding | str | None | File encoding. Common values: 'utf-8', 'latin-1', 'cp1252' |
encoding_errors | str | 'strict' | How to handle encoding errors: 'ignore', 'replace', 'strict' |
engine | str | None | Parser engine: 'c', 'python', or 'pyarrow' |
chunksize | int | None | Number of rows per chunk. Returns an iterator instead of a DataFrame |
nrows | int | None | Number of rows to read from the start of the file |
skiprows | int / list / callable | None | Rows to skip at the beginning of the file |
on_bad_lines | str | 'error' | Action for malformed rows: 'error', 'warn', 'skip' |
compression | str | 'infer' | Decompression: 'gzip', 'bz2', 'zip', 'xz', 'zstd', or 'infer' |
Setting Index Columns
By default, pandas creates a numeric index (0, 1, 2...). If your CSV has a natural identifier column, set it as the index during import to avoid doing it later.
# Set a single column as the index
df = pd.read_csv("employees.csv", index_col="employee_id")
# Set a multi-level index
df = pd.read_csv("sales.csv", index_col=["region", "store_id"])Using index_col during import is faster than calling df.set_index() afterward because it avoids a copy of the data in memory.
Selecting Specific Columns with usecols
Loading only the columns you need is one of the easiest performance wins:
# Select columns by name
df = pd.read_csv("transactions.csv", usecols=["date", "amount", "category"])
# Select columns by position
df = pd.read_csv("transactions.csv", usecols=[0, 3, 5])
# Select columns using a function
df = pd.read_csv("transactions.csv", usecols=lambda col: col.startswith("price"))On a file with 100 columns where you only need 5, usecols can reduce load time by 60-80% and memory by a similar margin.
Data Type Control with dtype and converters
Pandas infers types by scanning the data, but this inference is not always correct. Zip codes become integers, losing leading zeros. ID columns become float64 when any value is missing.
Specifying dtypes directly
df = pd.read_csv("customers.csv", dtype={
"zip_code": "str",
"customer_id": "Int64", # Nullable integer (capital I)
"revenue": "float32",
"is_active": "boolean", # Nullable boolean
})Using converters for custom logic
df = pd.read_csv("products.csv", converters={
"price": lambda x: float(x.replace("$", "").replace(",", "")),
"sku": lambda x: x.strip().upper(),
})The key difference: dtype is applied at the engine level and is fast. converters call Python functions per row and are slower but more flexible. Use dtype when you can, converters when you must.
Date Parsing (parse_dates and date_format)
Parsing dates during import is cleaner than converting columns afterward with pd.to_datetime().
Basic date parsing
df = pd.read_csv("orders.csv", parse_dates=["order_date", "ship_date"])Combining multiple columns into one date (Pandas 1.x style)
# If your CSV has separate year, month, day columns
df = pd.read_csv("log.csv", parse_dates={"timestamp": ["year", "month", "day"]})Specifying date format (Pandas 2.0+)
The date_parser parameter was deprecated in Pandas 2.0. Use date_format instead:
df = pd.read_csv("events.csv",
parse_dates=["event_date"],
date_format="%d/%m/%Y" # day/month/year format
)
# Different formats for different columns
df = pd.read_csv("mixed.csv",
parse_dates=["created", "updated"],
date_format={"created": "%Y-%m-%d", "updated": "%m/%d/%Y %H:%M"}
)Specifying the format explicitly is 5-10x faster than letting pandas guess, because it skips the format inference step.
Handling Missing Values
Custom NA markers
df = pd.read_csv("survey.csv", na_values=["N/A", "n/a", "-", "missing", "?", "999"])Per-column NA values
df = pd.read_csv("data.csv", na_values={
"age": ["unknown", "-1"],
"salary": ["N/A", "0"],
})Disabling default NA detection
By default, pandas treats over a dozen strings as NaN (including empty strings, "NA", "NULL", "NaN"). After loading, you can fill those NaN values with defaults. If your data legitimately contains these strings:
df = pd.read_csv("data.csv", keep_default_na=False, na_values=[""])This tells pandas to only treat empty strings as NaN and keep everything else as-is.
Encoding Issues (encoding and encoding_errors)
Encoding problems are the most common read_csv error. The error message usually looks like: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9.
Standard fix: try common encodings
# Most files are UTF-8
df = pd.read_csv("data.csv", encoding="utf-8")
# European/Latin systems often use latin-1
df = pd.read_csv("data.csv", encoding="latin-1")
# Windows systems often use cp1252
df = pd.read_csv("data.csv", encoding="cp1252")Handling errors gracefully (Pandas 1.3+)
# Replace undecodable bytes with a replacement character
df = pd.read_csv("data.csv", encoding="utf-8", encoding_errors="replace")
# Skip undecodable bytes entirely
df = pd.read_csv("data.csv", encoding="utf-8", encoding_errors="ignore")Detecting encoding automatically
import chardet
with open("unknown.csv", "rb") as f:
result = chardet.detect(f.read(100000))
print(result) # {'encoding': 'ISO-8859-1', 'confidence': 0.73}
df = pd.read_csv("unknown.csv", encoding=result["encoding"])Different Delimiters (sep and delimiter)
Not all "CSV" files use commas. Tab-separated, semicolon-separated, and pipe-separated files are common.
# Tab-separated
df = pd.read_csv("data.tsv", sep="\t")
# Semicolon (common in European locales)
df = pd.read_csv("data.csv", sep=";")
# Pipe-delimited
df = pd.read_csv("data.txt", sep="|")
# Auto-detect delimiter (uses the Python engine, slower)
df = pd.read_csv("mystery.csv", sep=None, engine="python")When using sep=None, pandas switches to the Python engine and uses Python's built-in csv.Sniffer to detect the delimiter. This is convenient but significantly slower on large files.
Reading Large Files (chunksize, nrows, skiprows)
When a file is too large to fit in memory, or you only need a sample, these parameters help.
Preview the first N rows
# Load only the first 1,000 rows for exploration
df_sample = pd.read_csv("big_file.csv", nrows=1000)Stream in chunks
# Process 100,000 rows at a time
chunk_iter = pd.read_csv("big_file.csv", chunksize=100_000)
results = []
for chunk in chunk_iter:
filtered = chunk[chunk["status"] == "active"]
results.append(filtered)
df = pd.concat(results, ignore_index=True)Skip rows
# Skip the first 5 rows (e.g., metadata or comments)
df = pd.read_csv("data.csv", skiprows=5)
# Skip specific rows by index
df = pd.read_csv("data.csv", skiprows=[0, 2, 4])
# Skip rows using a condition (skip all odd rows)
df = pd.read_csv("data.csv", skiprows=lambda i: i > 0 and i % 2 == 1)Combined strategy for very large files
df = pd.read_csv(
"huge_file.csv",
usecols=["user_id", "event", "timestamp"],
dtype={"user_id": "int32", "event": "category"},
parse_dates=["timestamp"],
engine="pyarrow",
dtype_backend="pyarrow",
)Engine Selection: C vs Python vs PyArrow
Pandas offers three parsing engines, each with different tradeoffs.
Performance Comparison
| Engine | Speed | Memory | Regex sep | Auto-detect sep | PyArrow dtypes | Best For |
|---|---|---|---|---|---|---|
'c' (default) | Fast | Moderate | No | No | No | Standard CSV files |
'python' | Slow | High | Yes | Yes | No | Irregular formats, regex delimiters |
'pyarrow' | Fastest | Low | No | No | Yes | Large files, Pandas 2.0+ |
Benchmark: 1 GB CSV file with 10 million rows
| Engine + Configuration | Load Time | Peak Memory |
|---|---|---|
engine='c' (default) | 18.2s | 3.8 GB |
engine='c' + dtype specified | 14.6s | 2.9 GB |
engine='pyarrow' | 6.1s | 2.4 GB |
engine='pyarrow' + dtype_backend='pyarrow' | 5.3s | 1.6 GB |
The PyArrow engine is the best choice for most workloads in 2026. It requires pyarrow to be installed (pip install pyarrow).
# Fastest configuration for large files
df = pd.read_csv(
"large_data.csv",
engine="pyarrow",
dtype_backend="pyarrow",
)Note: The PyArrow engine does not support all parameters. Specifically, it does not support converters, on_bad_lines='warn', regex sep, or skipfooter. Fall back to the C or Python engine when you need those features.
Reading Compressed Files
Pandas can read compressed CSV files directly without manual decompression. By default, it infers the compression from the file extension.
# Gzip -- most common for large CSV files
df = pd.read_csv("data.csv.gz")
# Zip archive (reads the first file inside the archive)
df = pd.read_csv("data.zip")
# Bzip2
df = pd.read_csv("data.csv.bz2")
# Zstandard (fast compression/decompression)
df = pd.read_csv("data.csv.zst")
# Explicit compression when extension does not match
df = pd.read_csv("data.dat", compression="gzip")Reading compressed files is slightly slower than uncompressed due to decompression overhead, but the reduced disk I/O usually compensates. For archival storage, gzip or zstandard are strongly recommended.
Reading CSV from URLs
pd.read_csv() accepts any URL that returns CSV content:
# Read directly from a URL
url = "https://raw.githubusercontent.com/datasets/covid-19/main/data/countries-aggregated.csv"
df = pd.read_csv(url)
# Read from S3 (requires s3fs package)
df = pd.read_csv("s3://my-bucket/data/sales.csv")
# Read from Google Cloud Storage (requires gcsfs)
df = pd.read_csv("gs://my-bucket/data/sales.csv")
# Read from Azure Blob Storage (requires adlfs)
df = pd.read_csv("abfs://container@account.dfs.core.windows.net/data.csv")For authentication with cloud storage, install the corresponding filesystem library and configure credentials through environment variables or the library's configuration.
Header Handling (header, names, prefix)
Files with no header row
df = pd.read_csv("data.csv", header=None, names=["id", "name", "value"])Files with multi-row headers
# Use two rows as a multi-level header
df = pd.read_csv("report.csv", header=[0, 1])Skipping the existing header and providing your own
df = pd.read_csv("data.csv", header=0, names=["col_a", "col_b", "col_c"])When you specify both header=0 and names, pandas reads row 0 as the header, then replaces those names with the ones you provide. For renaming columns after loading, see the pandas rename column guide.
Performance Optimization
Loading CSV files is often the slowest step in a data pipeline. These techniques can reduce load time by 50-90%.
1. Use the PyArrow engine with Arrow-backed dtypes
df = pd.read_csv(
"data.csv",
engine="pyarrow",
dtype_backend="pyarrow",
)
print(df.dtypes)
# Output: columns show types like int64[pyarrow], string[pyarrow]Arrow-backed dtypes use less memory than NumPy defaults and handle missing values natively (no more float64 for integer columns with NaN).
2. Load only the columns you need
# Instead of loading 50 columns and selecting later:
df = pd.read_csv("wide_data.csv", usecols=["id", "date", "metric_1", "metric_2"])3. Specify dtypes to skip inference
df = pd.read_csv("data.csv", dtype={
"id": "int32",
"category": "category", # Categorical uses far less memory for repeated strings
"amount": "float32",
"flag": "bool",
})The category dtype is especially effective for columns with few unique values. A column with 10 million rows but only 50 unique strings can drop from 500 MB to 5 MB.
4. Stream with chunksize for out-of-memory files
import pandas as pd
agg_results = {}
for chunk in pd.read_csv("huge.csv", chunksize=500_000, usecols=["region", "sales"]):
partial = chunk.groupby("region")["sales"].sum()
for region, total in partial.items():
agg_results[region] = agg_results.get(region, 0) + total
summary = pd.Series(agg_results)
print(summary)5. Consider Parquet for repeated reads
If you read the same CSV file more than once, convert it to Parquet first. (For writing DataFrames back to CSV, see the pandas DataFrame to CSV guide.)
# One-time conversion
df = pd.read_csv("data.csv", engine="pyarrow", dtype_backend="pyarrow")
df.to_parquet("data.parquet", engine="pyarrow")
# All future reads (5-10x faster, 3-5x smaller)
df = pd.read_parquet("data.parquet")Common Errors and Fixes
UnicodeDecodeError
Error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 42
Cause: The file uses a different encoding than UTF-8 (commonly latin-1, cp1252, or shift_jis).
# Fix: Try common encodings
df = pd.read_csv("data.csv", encoding="latin-1")
# Or handle gracefully
df = pd.read_csv("data.csv", encoding="utf-8", encoding_errors="replace")ParserError: Error tokenizing data
Error: ParserError: Error tokenizing data. C error: Expected 5 fields in line 42, saw 7
Cause: Some rows have more columns than the header, usually due to unquoted commas in text fields.
# Fix 1: Skip bad lines
df = pd.read_csv("data.csv", on_bad_lines="skip")
# Fix 2: Use the Python engine, which handles more edge cases
df = pd.read_csv("data.csv", engine="python", on_bad_lines="skip")
# Fix 3: If the issue is a different delimiter
df = pd.read_csv("data.csv", sep=";")MemoryError
Error: MemoryError or the process gets killed by the OS.
Cause: The file is too large to load into memory as a single DataFrame.
# Fix 1: Load only needed columns with efficient types
df = pd.read_csv("huge.csv",
usecols=["id", "date", "value"],
dtype={"id": "int32", "value": "float32"},
engine="pyarrow",
dtype_backend="pyarrow",
)
# Fix 2: Process in chunks
for chunk in pd.read_csv("huge.csv", chunksize=200_000):
process(chunk)
# Fix 3: Use a sample
df = pd.read_csv("huge.csv", nrows=100_000)DtypeWarning: Columns have mixed types
Warning: DtypeWarning: Columns (4,7) have mixed types. Specify dtype option on import.
Cause: A column has both numeric and string values (e.g., an ID column where one row contains "N/A").
# Fix: Specify the dtype explicitly
df = pd.read_csv("data.csv", dtype={"column_4": "str", "column_7": "str"})
# Or set low_memory=False to read the entire column before inferring
df = pd.read_csv("data.csv", low_memory=False)read_csv vs read_excel vs read_parquet
Choosing the right reader depends on your data source and requirements.
| Feature | read_csv() | read_excel() | read_parquet() |
|---|---|---|---|
| Speed (1M rows) | Moderate (2-5s) | Slow (15-30s) | Fast (0.3-1s) |
| File size (1M rows) | 80-200 MB | 50-150 MB | 15-40 MB |
| Type preservation | No (re-inferred) | Partial | Full |
| Schema included | No | No | Yes |
| Human readable | Yes | Yes (with Excel) | No |
| Streaming (chunksize) | Yes | No | Yes (row groups) |
| Column selection before load | Yes (usecols) | Yes (usecols) | Yes (columns) |
| Dependencies | None | openpyxl | pyarrow or fastparquet |
| Best for | Data exchange, exports | Business reports, shared files | Analytics pipelines, data lakes |
Rule of thumb: Use CSV for data exchange with non-Python systems. Use Parquet for internal analytics pipelines. Use read_excel() only when required by stakeholders who work in spreadsheets.
Visualize Your Data After Loading
Once your CSV data is loaded into a DataFrame, the next step is usually exploration and visualization. Instead of writing matplotlib or seaborn code from scratch, PyGWalker (opens in a new tab) turns any DataFrame into an interactive visual interface directly inside Jupyter Notebook.
import pandas as pd
import pygwalker as pyg
df = pd.read_csv("sales.csv", engine="pyarrow", dtype_backend="pyarrow")
walker = pyg.walk(df)Drag and drop columns to create charts instantly -- no chart code needed. This is especially useful for quick data profiling right after import. PyGWalker works with both pandas and Polars DataFrames.
For more advanced data analysis workflows in Jupyter, RunCell (opens in a new tab) provides AI-powered assistance that can help you write data loading code, fix errors, and automate repetitive data preparation tasks.
FAQ
What does pd.read_csv() do?
pd.read_csv() is a function in the pandas library that reads a comma-separated values (CSV) file and loads it into a pandas DataFrame. It supports dozens of parameters for controlling delimiters, data types, date parsing, encoding, and memory usage. It can also read from URLs, compressed files, and cloud storage.
How do I read a CSV file with a different delimiter?
Use the sep parameter. For tab-separated files, use sep='\t'. For semicolons, use sep=';'. For pipes, use sep='|'. You can also set sep=None with engine='python' to let pandas auto-detect the delimiter, though this is slower.
Why is pd.read_csv() slow on large files?
The default C engine and NumPy-based type inference add overhead on large files. To speed it up: use engine='pyarrow' with dtype_backend='pyarrow' for 2-3x faster parsing; specify usecols to load only needed columns; set explicit dtype to skip type inference; and use chunksize for streaming processing of files that exceed available memory.
How do I fix UnicodeDecodeError in pd.read_csv()?
This error means the file uses a different encoding than the default UTF-8. Try encoding='latin-1' or encoding='cp1252' as these are the most common alternatives. You can also use encoding_errors='replace' to substitute unreadable characters with a placeholder instead of raising an error.
Should I use read_csv or read_parquet for data analysis?
Use read_csv() when receiving data from external systems or when the source file is in CSV format. For repeated reads in an analytics pipeline, convert the CSV to Parquet once with df.to_parquet() and then use read_parquet() going forward. Parquet is 5-10x faster to read, 3-5x smaller on disk, and preserves column types without re-inference.
Related Guides
- Pandas to_datetime: Parse and Convert Dates -- convert date columns after loading
- Pandas fillna: Handle Missing Values -- fill NaN values from CSV imports
- Pandas read_excel: Load Excel Files -- load .xlsx files instead of CSV
- Pandas DataFrame to CSV -- write DataFrames back to CSV files
- Pandas Rename Column -- rename columns after import