Skip to content
Topics
Pandas
Pandas Read Excel: How to Import Excel Files in Python

Pandas Read Excel: How to Import Excel Files in Python

Updated on

Excel files remain the most common format for storing tabular data in business environments, yet importing them into Python for analysis often creates friction. You need to handle multiple sheets, inconsistent data types, missing values, and memory constraints—all while maintaining data integrity. The pandas read_excel() function solves these challenges, but only if you know how to configure its 30+ parameters correctly.

This guide shows you how to import Excel files into pandas DataFrames efficiently, from basic single-sheet imports to advanced scenarios like reading multiple sheets simultaneously, handling large files, and troubleshooting common errors.

📚

Basic Usage of pandas read_excel()

The simplest way to read an Excel file into a pandas DataFrame requires just the file path:

import pandas as pd
 
# Read the first sheet of an Excel file
df = pd.read_excel('data.xlsx')
print(df.head())

This reads the first sheet by default, treating the first row as column headers. The function automatically detects the file format (.xlsx or .xls) and uses the appropriate engine.

For files in different directories, use absolute or relative paths:

# Absolute path
df = pd.read_excel('/Users/username/Documents/sales_data.xlsx')
 
# Relative path
df = pd.read_excel('../data/sales_data.xlsx')

Understanding Key Parameters

The read_excel() function provides extensive control over how data is imported. Here's a reference table of the most important parameters:

ParameterTypeDescriptionExample
sheet_namestr, int, list, NoneSpecifies which sheet(s) to readsheet_name='Sales' or sheet_name=0
headerint, list, NoneRow(s) to use as column namesheader=0 (default) or header=None
usecolsstr, listColumns to parseusecols='A:D' or usecols=[0,1,2]
dtypedict, TypeData type for columnsdtype={'ID': str, 'Age': int}
skiprowsint, listRows to skip at startskiprows=2 or skiprows=[0,2,5]
nrowsintNumber of rows to readnrows=1000
na_valuesscalar, str, list, dictValues to recognize as NaNna_values=['NA', 'N/A', '']
parse_datesbool, list, dictColumns to parse as datesparse_dates=['Date']
enginestrExcel engine to useengine='openpyxl' or engine='xlrd'

Working with Multiple Sheets

Reading a Specific Sheet

Specify sheets by name or index (0-based):

# Read by sheet name
df = pd.read_excel('financial_report.xlsx', sheet_name='Q1 Revenue')
 
# Read by index (0 = first sheet)
df = pd.read_excel('financial_report.xlsx', sheet_name=0)

Reading Multiple Sheets at Once

Pass a list of sheet names or indices to import multiple sheets into a dictionary of DataFrames:

# Read multiple specific sheets
sheets_dict = pd.read_excel('annual_report.xlsx',
                            sheet_name=['Q1', 'Q2', 'Q3', 'Q4'])
 
# Access individual DataFrames
q1_df = sheets_dict['Q1']
q2_df = sheets_dict['Q2']
 
print(f"Q1 shape: {q1_df.shape}")
print(f"Q2 shape: {q2_df.shape}")

Reading All Sheets

Use sheet_name=None to import every sheet in the workbook:

# Read all sheets into a dictionary
all_sheets = pd.read_excel('company_data.xlsx', sheet_name=None)
 
# Iterate through all sheets
for sheet_name, df in all_sheets.items():
    print(f"\nSheet: {sheet_name}")
    print(f"Shape: {df.shape}")
    print(f"Columns: {df.columns.tolist()}")

This is particularly useful when you need to process multiple sheets with similar structures:

# Combine all sheets into a single DataFrame
all_sheets = pd.read_excel('monthly_sales.xlsx', sheet_name=None)
combined_df = pd.concat(all_sheets.values(), ignore_index=True)
print(f"Combined shape: {combined_df.shape}")

Controlling Data Import with Headers and Columns

Specifying Header Rows

Control which row becomes the column names:

# Skip first 2 rows, use row 3 as header
df = pd.read_excel('data.xlsx', header=2)
 
# No header row - pandas assigns numeric column names
df = pd.read_excel('data.xlsx', header=None)
 
# Multi-level column headers
df = pd.read_excel('data.xlsx', header=[0, 1])

Skipping Rows

Remove unwanted rows from the top of the file:

# Skip first 3 rows
df = pd.read_excel('data.xlsx', skiprows=3)
 
# Skip specific rows (0-indexed)
df = pd.read_excel('data.xlsx', skiprows=[0, 2, 5])
 
# Skip rows using a function
df = pd.read_excel('data.xlsx', skiprows=lambda x: x % 2 == 0)  # Skip even rows

Selecting Specific Columns

Import only the columns you need to reduce memory usage:

# Select columns by letter range
df = pd.read_excel('data.xlsx', usecols='A:D')
 
# Select specific columns by letter
df = pd.read_excel('data.xlsx', usecols='A,C,E,G')
 
# Select columns by index
df = pd.read_excel('data.xlsx', usecols=[0, 2, 4, 6])
 
# Select columns by name
df = pd.read_excel('data.xlsx', usecols=['Name', 'Age', 'Salary'])
 
# Select columns using a function
df = pd.read_excel('data.xlsx', usecols=lambda x: 'total' in x.lower())

Data Type Specification and Conversion

Setting Column Data Types

Explicitly define data types to prevent automatic inference errors:

# Specify dtypes for multiple columns
df = pd.read_excel('customer_data.xlsx',
                   dtype={
                       'CustomerID': str,
                       'ZipCode': str,
                       'Age': int,
                       'Revenue': float
                   })
 
# Keep numeric IDs as strings to preserve leading zeros
df = pd.read_excel('products.xlsx', dtype={'ProductID': str})

Handling Dates

Parse date columns automatically:

# Parse single date column
df = pd.read_excel('orders.xlsx', parse_dates=['OrderDate'])
 
# Parse multiple date columns
df = pd.read_excel('employee_data.xlsx',
                   parse_dates=['HireDate', 'TerminationDate'])
 
# Combine columns into a single datetime
df = pd.read_excel('logs.xlsx',
                   parse_dates={'DateTime': ['Date', 'Time']})
 
# Check the result
print(df.dtypes)

Managing Missing Values

Control how pandas identifies and handles missing data:

# Recognize custom missing value indicators
df = pd.read_excel('survey_data.xlsx',
                   na_values=['NA', 'N/A', 'null', 'None', '-'])
 
# Different NA values per column
df = pd.read_excel('mixed_data.xlsx',
                   na_values={
                       'Age': ['Unknown', 'N/A'],
                       'Salary': [0, -1, 'Not Disclosed']
                   })
 
# Keep default NA values and add more
df = pd.read_excel('data.xlsx',
                   na_values=['Missing'],
                   keep_default_na=True)

Handling Large Excel Files

Excel files consume significant memory when loaded entirely into RAM. Use these strategies for large datasets:

Limiting Rows

Read only the rows you need:

# Read first 10,000 rows
df = pd.read_excel('large_file.xlsx', nrows=10000)
 
# Read specific row range using skiprows and nrows
df = pd.read_excel('large_file.xlsx', skiprows=1000, nrows=5000)

Using Converters for Memory Efficiency

Apply transformations during import to reduce memory footprint:

# Convert columns during read
def categorize_status(value):
    return 'Active' if value == 1 else 'Inactive'
 
df = pd.read_excel('users.xlsx',
                   converters={
                       'Status': categorize_status,
                       'Category': lambda x: x.lower().strip()
                   })

Processing Sheets Iteratively

For workbooks with multiple large sheets, process one sheet at a time:

import pandas as pd
 
# Get all sheet names first
xls = pd.ExcelFile('massive_workbook.xlsx')
sheet_names = xls.sheet_names
 
# Process each sheet separately
for sheet in sheet_names:
    df = pd.read_excel(xls, sheet_name=sheet, nrows=10000)
    # Process df here
    result = df.describe()
    print(f"\n{sheet} Summary:\n{result}")

Alternative: Using openpyxl for Streaming

For extremely large files, consider reading row-by-row with openpyxl:

from openpyxl import load_workbook
 
# Load workbook in read-only mode
wb = load_workbook('huge_file.xlsx', read_only=True)
ws = wb['Sheet1']
 
# Process rows iteratively
data = []
for row in ws.iter_rows(min_row=2, max_row=1000, values_only=True):
    data.append(row)
 
# Convert to DataFrame
df = pd.DataFrame(data, columns=['Col1', 'Col2', 'Col3'])
wb.close()

Understanding Excel Engines

Pandas uses different libraries to read Excel files depending on the format and installed packages:

EngineFile FormatInstallationUse Case
openpyxl.xlsxpip install openpyxlModern Excel files (Excel 2010+), default for .xlsx
xlrd.xls, .xlsxpip install xlrdLegacy .xls files (Excel 2003), deprecated for .xlsx
pyxlsb.xlsbpip install pyxlsbBinary Excel files
odf.odspip install odfpyOpenDocument spreadsheets

Specify the engine explicitly:

# Force openpyxl engine
df = pd.read_excel('data.xlsx', engine='openpyxl')
 
# Read legacy .xls file
df = pd.read_excel('old_data.xls', engine='xlrd')
 
# Read binary Excel file
df = pd.read_excel('data.xlsb', engine='pyxlsb')

Pandas read_excel vs read_csv: Key Differences

While both functions create DataFrames, they have important differences:

Featureread_excel()read_csv()
File FormatBinary Excel files (.xlsx, .xls)Plain text CSV files
SpeedSlower (complex format parsing)Faster (simple text parsing)
DependenciesRequires openpyxl/xlrdNo additional dependencies
Multiple SheetsYes (sheet_name parameter)No (single file only)
Formatted DataPreserves some Excel formattingNo formatting (plain text)
Memory UsageHigher (binary parsing overhead)Lower (text streaming)
FormulasReads calculated values onlyN/A
Date ParsingAutomatic Excel date conversionRequires parse_dates
Large FilesMemory-intensiveSupports chunksize parameter
Best ForBusiness data, multi-sheet reportsLarge datasets, data exports

Recommendation: Use read_csv() for large-scale data processing and read_excel() when you need to work with existing Excel workbooks or multiple sheets.

Visualize Excel Data Instantly with PyGWalker

After importing Excel data with pandas, you often need to explore it visually. PyGWalker (opens in a new tab) transforms your DataFrame into an interactive Tableau-like visualization interface directly in your Jupyter notebook—no need to write plotting code.

import pandas as pd
import pygwalker as pyg
 
# Read Excel data
df = pd.read_excel('sales_data.xlsx')
 
# Launch interactive visualization
pyg.walk(df)

This opens a drag-and-drop interface where you can:

  • Create charts by dragging columns to visual channels
  • Switch between chart types (bar, line, scatter, heatmap) instantly
  • Apply filters and aggregations interactively
  • Export visualizations as images or code

PyGWalker is particularly useful for exploratory data analysis after importing Excel files, letting you understand data distributions and relationships without writing matplotlib or seaborn code.

Common Errors and Troubleshooting

ModuleNotFoundError: No module named 'openpyxl'

This is the most common error when reading .xlsx files:

ImportError: Missing optional dependency 'openpyxl'. Use pip or conda to install openpyxl.

Solution: Install the required engine:

# Install openpyxl
pip install openpyxl
 
# Or install all Excel dependencies
pip install pandas[excel]

XLRDError: Excel xlsx file; not supported

If you see this error, xlrd is trying to read a .xlsx file (only supports .xls since version 2.0):

Solution: Install openpyxl or specify the engine:

df = pd.read_excel('data.xlsx', engine='openpyxl')

FileNotFoundError

Check your file path and working directory:

import os
 
# Print current directory
print(os.getcwd())
 
# List files in directory
print(os.listdir('.'))
 
# Use absolute path
df = pd.read_excel('/absolute/path/to/file.xlsx')

Memory Errors with Large Files

If pandas crashes with memory errors:

# Reduce memory by selecting columns
df = pd.read_excel('large_file.xlsx', usecols=['A', 'B', 'C'])
 
# Limit rows
df = pd.read_excel('large_file.xlsx', nrows=50000)
 
# Specify dtypes to reduce memory
df = pd.read_excel('large_file.xlsx',
                   dtype={'ID': 'int32', 'Category': 'category'})

ValueError: Excel file format cannot be determined

This occurs when the file extension doesn't match the actual format:

Solution: Specify the engine explicitly or rename the file with the correct extension:

# Try different engines
df = pd.read_excel('data.xlsx', engine='openpyxl')

Date Parsing Issues

Excel stores dates as serial numbers. If dates aren't parsed correctly:

# Force date parsing
df = pd.read_excel('data.xlsx', parse_dates=['DateColumn'])
 
# Manual conversion
df['DateColumn'] = pd.to_datetime(df['DateColumn'])

Advanced Techniques

Reading Password-Protected Files

While pandas doesn't support password-protected files directly, use msoffcrypto-tool:

import msoffcrypto
import io
import pandas as pd
 
# Decrypt the file
decrypted = io.BytesIO()
with open('protected.xlsx', 'rb') as f:
    file = msoffcrypto.OfficeFile(f)
    file.load_key(password='yourpassword')
    file.decrypt(decrypted)
 
# Read the decrypted file
df = pd.read_excel(decrypted)

Combining read_excel with Context Managers

For better resource management:

from contextlib import closing
import pandas as pd
 
with closing(pd.ExcelFile('data.xlsx')) as xls:
    df1 = pd.read_excel(xls, sheet_name='Sheet1')
    df2 = pd.read_excel(xls, sheet_name='Sheet2')
    # File is automatically closed after this block

Working with URLs

Read Excel files directly from URLs:

url = 'https://example.com/data.xlsx'
df = pd.read_excel(url)

Using RunCell for Excel Data Debugging

When working with complex Excel imports in Jupyter notebooks, RunCell (opens in a new tab) provides AI-powered assistance for debugging import errors, optimizing parameters, and fixing data type issues. The AI agent can:

  • Diagnose why read_excel() isn't parsing dates correctly
  • Suggest optimal dtype specifications for memory efficiency
  • Debug engine compatibility issues
  • Help troubleshoot encoding and formatting problems

This is particularly valuable when importing messy real-world Excel files with inconsistent formatting or complex multi-sheet structures.

FAQ

How do I read an Excel file without headers in pandas?

Set header=None to tell pandas the first row is data, not column names. Pandas will assign numeric column names (0, 1, 2...):

df = pd.read_excel('data.xlsx', header=None)

You can then rename columns manually:

df.columns = ['Name', 'Age', 'Salary']

Can pandas read multiple Excel files at once?

Yes, use a loop or list comprehension to read multiple files and concatenate them:

import pandas as pd
from glob import glob
 
# Read all Excel files in a directory
files = glob('data/*.xlsx')
dfs = [pd.read_excel(f) for f in files]
combined_df = pd.concat(dfs, ignore_index=True)

How do I read only specific rows from an Excel file?

Combine skiprows and nrows parameters:

# Read rows 100-199 (skip first 100, read next 100)
df = pd.read_excel('data.xlsx', skiprows=100, nrows=100)

Why is read_excel so slow compared to read_csv?

Excel files are binary formats that require complex parsing, while CSV files are plain text. For large datasets, convert Excel to CSV first:

# One-time conversion
df = pd.read_excel('large_file.xlsx')
df.to_csv('large_file.csv', index=False)
 
# Future reads use CSV
df = pd.read_csv('large_file.csv')  # Much faster

How do I handle merged cells in Excel files?

Pandas reads the top-left value of merged cells and fills the rest with NaN. You can forward-fill these values:

df = pd.read_excel('data.xlsx')
df['MergedColumn'] = df['MergedColumn'].fillna(method='ffill')

Conclusion

The pandas read_excel() function provides comprehensive control over importing Excel files into Python, from simple single-sheet imports to complex scenarios involving multiple sheets, custom data types, and large file handling. Master the key parameters—sheet_name, usecols, dtype, parse_dates, and skiprows—to import data efficiently and accurately.

For most workflows, the basic pd.read_excel('file.xlsx') suffices, but understanding advanced options like engine selection, converter functions, and memory optimization techniques becomes critical when working with real-world business data. Remember to install the appropriate engine (openpyxl for .xlsx files) and consider converting large Excel files to CSV for better performance.

The combination of pandas for data import, PyGWalker for instant visualization, and RunCell for debugging creates a powerful workflow for Excel data analysis in Python.

📚