Pandas Concat: How to Concatenate DataFrames in Python
Updated on
Real-world data rarely lives in a single file. You pull January sales from one CSV, February from another, and Q1 targets from a third. You scrape multiple web pages into separate DataFrames. You split a large dataset for parallel processing and need to reassemble the pieces. In every case, you need a reliable way to combine DataFrames without losing rows, scrambling columns, or corrupting your index.
The pandas concat function (pd.concat()) is the standard tool for this job. It stacks DataFrames vertically (adding rows) or horizontally (adding columns), handles mismatched columns gracefully, and scales to any number of DataFrames in a single call. This guide covers every parameter you need, with working code examples you can paste directly into your notebook.
What pd.concat() Does -- Basic Syntax
pd.concat() takes a list (or dict) of DataFrames and glues them together along a specified axis. Think of it as stacking building blocks -- vertically to add more rows, or side by side to add more columns.
import pandas as pd
result = pd.concat(
objs, # list or dict of DataFrames
axis=0, # 0 = vertical (rows), 1 = horizontal (columns)
join='outer', # 'outer' or 'inner'
ignore_index=False,
keys=None,
sort=False,
verify_integrity=False
)Key Parameters at a Glance
| Parameter | Description | Default |
|---|---|---|
objs | A list or dict of DataFrames (or Series) to concatenate | Required |
axis | 0 to stack rows (vertical), 1 to stack columns (horizontal) | 0 |
join | 'outer' keeps all columns; 'inner' keeps only shared columns | 'outer' |
ignore_index | If True, resets the index to 0, 1, 2, ... in the result | False |
keys | Labels to identify which original DataFrame each row came from | None |
sort | Sort non-concatenation axis (column names for axis=0) | False |
verify_integrity | Raise error if the result has duplicate index values | False |
Sample Data for All Examples
Every example below uses these DataFrames:
import pandas as pd
df_jan = pd.DataFrame({
'product': ['Widget', 'Gadget', 'Sprocket'],
'units_sold': [150, 200, 80],
'revenue': [1500, 3000, 960]
})
df_feb = pd.DataFrame({
'product': ['Widget', 'Gadget', 'Sprocket'],
'units_sold': [170, 180, 95],
'revenue': [1700, 2700, 1140]
})
print(df_jan)
print(df_feb)Output:
product units_sold revenue
0 Widget 150 1500
1 Gadget 200 3000
2 Sprocket 80 960
product units_sold revenue
0 Widget 170 1700
1 Gadget 180 2700
2 Sprocket 95 1140Concatenating DataFrames Vertically (axis=0)
Vertical concatenation is the most common use case. It stacks one DataFrame on top of another, adding rows. This is what you use when you have monthly files, batch results, or any data split across multiple tables with the same columns.
combined = pd.concat([df_jan, df_feb])
print(combined)Output:
product units_sold revenue
0 Widget 150 1500
1 Gadget 200 3000
2 Sprocket 80 960
0 Widget 170 1700
1 Gadget 180 2700
2 Sprocket 95 1140Notice the index: both DataFrames kept their original index values (0, 1, 2), so the result has duplicate index values. This is usually not what you want. The fix is the ignore_index parameter, covered next.
The ignore_index Parameter -- Resetting the Index
Setting ignore_index=True discards the original index and assigns a fresh sequential index starting from 0:
combined = pd.concat([df_jan, df_feb], ignore_index=True)
print(combined)Output:
product units_sold revenue
0 Widget 150 1500
1 Gadget 200 3000
2 Sprocket 80 960
3 Widget 170 1700
4 Gadget 180 2700
5 Sprocket 95 1140When to use it: Almost always when concatenating vertically. Unless your index carries meaningful information (like timestamps or unique IDs), reset it to avoid confusion later.
The keys Parameter -- Creating a Hierarchical Index
The keys parameter adds a level to the index that labels which source each row came from. This creates a MultiIndex (hierarchical index):
combined = pd.concat([df_jan, df_feb], keys=['January', 'February'])
print(combined)Output:
product units_sold revenue
January 0 Widget 150 1500
1 Gadget 200 3000
2 Sprocket 80 960
February 0 Widget 170 1700
1 Gadget 180 2700
2 Sprocket 95 1140You can then select data from a specific source using .loc:
# Get only January data
jan_data = combined.loc['January']
print(jan_data)Output:
product units_sold revenue
0 Widget 150 1500
1 Gadget 200 3000
2 Sprocket 80 960When to use it: Use keys when you need to track which original DataFrame each row belongs to -- for example, data from different experiments, time periods, or data sources.
Concatenating Horizontally (axis=1) -- Side by Side
Setting axis=1 places DataFrames next to each other, adding columns. Pandas aligns rows by their index values.
targets = pd.DataFrame({
'target_units': [160, 190, 90],
'target_revenue': [1600, 2850, 1080]
})
result = pd.concat([df_jan, targets], axis=1)
print(result)Output:
product units_sold revenue target_units target_revenue
0 Widget 150 1500 160 1600
1 Gadget 200 3000 190 2850
2 Sprocket 80 960 90 1080This works cleanly because both DataFrames share the same index (0, 1, 2). If the indexes do not align, you get NaN values for the mismatched rows:
df_a = pd.DataFrame({'value_a': [10, 20, 30]}, index=[0, 1, 2])
df_b = pd.DataFrame({'value_b': [40, 50, 60]}, index=[1, 2, 3])
result = pd.concat([df_a, df_b], axis=1)
print(result)Output:
value_a value_b
0 10.0 NaN
1 20.0 40.0
2 30.0 50.0
3 NaN 60.0Row 0 has no value_b (no matching index in df_b), and row 3 has no value_a (no matching index in df_a).
The join Parameter -- Inner vs Outer
The join parameter controls what happens when DataFrames have different columns (for axis=0) or different index values (for axis=1).
outer join (default) -- Keep Everything
df_with_extra = pd.DataFrame({
'product': ['Widget', 'Gadget'],
'units_sold': [200, 250],
'region': ['East', 'West']
})
result = pd.concat([df_jan, df_with_extra], join='outer', ignore_index=True)
print(result)Output:
product units_sold revenue region
0 Widget 150 1500.0 NaN
1 Gadget 200 3000.0 NaN
2 Sprocket 80 960.0 NaN
3 Widget 200 NaN East
4 Gadget 250 NaN WestAll columns from both DataFrames appear. Missing values are filled with NaN.
inner join -- Keep Only Shared Columns
result = pd.concat([df_jan, df_with_extra], join='inner', ignore_index=True)
print(result)Output:
product units_sold
0 Widget 150
1 Gadget 200
2 Sprocket 80
3 Widget 200
4 Gadget 250Only columns that exist in both DataFrames survive. The revenue column (missing from df_with_extra) and region column (missing from df_jan) are both dropped.
When to use inner join: When you want a clean result with no NaN values and are willing to lose columns that do not appear in every DataFrame.
Concatenating a List of DataFrames
One of the biggest advantages of pd.concat() over other combining methods is that it handles any number of DataFrames in a single call. This is the standard pattern for combining files loaded in a loop:
import pandas as pd
# Simulate loading monthly CSV files
months = {
'Jan': {'product': ['Widget', 'Gadget'], 'units': [150, 200]},
'Feb': {'product': ['Widget', 'Gadget'], 'units': [170, 180]},
'Mar': {'product': ['Widget', 'Gadget'], 'units': [190, 210]},
}
dfs = []
for month, data in months.items():
df = pd.DataFrame(data)
df['month'] = month
dfs.append(df)
all_data = pd.concat(dfs, ignore_index=True)
print(all_data)Output:
product units month
0 Widget 150 Jan
1 Gadget 200 Jan
2 Widget 170 Feb
3 Gadget 180 Feb
4 Widget 190 Mar
5 Gadget 210 MarThis pattern -- build a list of DataFrames, then concat once at the end -- is much faster than appending one DataFrame at a time in a loop. Each append creates a full copy, but a single pd.concat() call allocates memory once.
concat vs merge vs append -- Comparison Table
Pandas offers several ways to combine DataFrames. Choosing the right one depends on how you want to combine them:
| Feature | pd.concat() | pd.merge() | DataFrame.append() |
|---|---|---|---|
| Primary use | Stack DataFrames (rows or columns) | Join on shared column values (like SQL) | Add rows to a DataFrame |
| Number of inputs | Any number at once | Two at a time | Two at a time |
| Matching logic | Aligns by index (or column names) | Matches on key column values | Aligns by column names |
| Join types | outer, inner | inner, left, right, outer, cross | Outer only |
| Default behavior | Outer join, stack vertically | Inner join on shared columns | Outer join, append rows |
| Best for | Combining monthly files, batch results, same-schema tables | Relational joins (customer + orders) | Deprecated since pandas 2.0 |
| Performance | Fast for many DataFrames | Optimized for two-table joins | Slow (copies data each call) |
When to Use Each
- Use
pd.concat()when your DataFrames share the same structure (same columns) and you want to stack them. Also use it for horizontal concatenation when aligning by index. - Use
pd.merge()when you need to match rows based on column values -- like joining a sales table with a products table onproduct_id. See our pandas merge guide for details. - Avoid
DataFrame.append()-- it was deprecated in pandas 1.4 and removed in pandas 2.0. Usepd.concat([df1, df2])instead.
Common Errors and Fixes
1. Columns Do Not Align
When concatenating DataFrames with different column names, the default outer join fills missing values with NaN. If you did not expect this, check your column names:
# Diagnose: compare column names
print(df1.columns.tolist())
print(df2.columns.tolist())
# Fix: rename columns to match before concatenating
df2 = df2.rename(columns={'sales': 'revenue', 'qty': 'units_sold'})
combined = pd.concat([df1, df2], ignore_index=True)2. Dtype Mismatches After Concatenation
If one DataFrame stores a column as int64 and another stores it as float64, pandas upcasts to float. Worse, if one stores it as a string, you get an object dtype column:
# Check dtypes after concat
combined = pd.concat([df1, df2], ignore_index=True)
print(combined.dtypes)
# Fix: cast before concatenating
df2['units_sold'] = df2['units_sold'].astype(int)
combined = pd.concat([df1, df2], ignore_index=True)3. Duplicated Index Values
Without ignore_index=True, vertical concatenation preserves original indexes, leading to duplicate values. This causes problems with .loc lookups:
combined = pd.concat([df1, df2])
# combined.loc[0] returns TWO rows, not one
# Fix option 1: use ignore_index
combined = pd.concat([df1, df2], ignore_index=True)
# Fix option 2: use verify_integrity to catch the issue early
combined = pd.concat([df1, df2], verify_integrity=True) # raises ValueError4. Accidentally Concatenating Along the Wrong Axis
If your result has twice the columns instead of twice the rows (or vice versa), check the axis parameter:
# Wrong: this adds columns side by side
wrong = pd.concat([df1, df2], axis=1)
# Right: this stacks rows vertically
right = pd.concat([df1, df2], axis=0)Visualize Your Concatenated DataFrames with PyGWalker
After concatenating data from multiple sources, you often need to verify the result and explore patterns across the combined dataset. Instead of writing manual plotting code with matplotlib or seaborn, you can use PyGWalker (opens in a new tab) -- an open-source Python library that turns any pandas DataFrame into an interactive, Tableau-like visual exploration interface directly inside Jupyter Notebook.
import pandas as pd
import pygwalker as pyg
# Combine monthly sales data
df_jan = pd.DataFrame({
'product': ['Widget', 'Gadget', 'Sprocket'],
'units_sold': [150, 200, 80],
'revenue': [1500, 3000, 960],
'month': ['Jan', 'Jan', 'Jan']
})
df_feb = pd.DataFrame({
'product': ['Widget', 'Gadget', 'Sprocket'],
'units_sold': [170, 180, 95],
'revenue': [1700, 2700, 1140],
'month': ['Feb', 'Feb', 'Feb']
})
combined = pd.concat([df_jan, df_feb], ignore_index=True)
# Launch interactive visualization
walker = pyg.walk(combined)With PyGWalker, you can drag product to the x-axis and revenue to the y-axis, then split by month to instantly compare revenue trends across periods -- no chart code required. You can create bar charts, scatter plots, line charts, and more just by dragging fields. It is especially useful for verifying that your concatenation worked correctly and that data from different sources aligns as expected.
Install PyGWalker with
pip install pygwalker, or try it in Google Colab (opens in a new tab) or Kaggle (opens in a new tab).
FAQ
What is the difference between pandas concat and merge?
pd.concat() stacks DataFrames vertically (adding rows) or horizontally (adding columns) by aligning on the index. pd.merge() joins two DataFrames by matching values in specific columns, like a SQL JOIN. Use concat when your DataFrames have the same columns and you want to combine rows. Use merge when you need to match rows based on a shared key column.
Does pd.concat() modify the original DataFrames?
No. pd.concat() always returns a new DataFrame. The original DataFrames remain unchanged. This is consistent with the pandas design principle that operations return new objects rather than modifying data in place.
How do I concatenate DataFrames with different columns?
Use pd.concat() with the default join='outer' -- it keeps all columns from all DataFrames and fills missing values with NaN. If you only want columns that appear in every DataFrame, set join='inner'. You can also rename columns before concatenating to ensure alignment.
Is pd.concat() faster than DataFrame.append()?
Yes. DataFrame.append() was deprecated in pandas 1.4 and removed in pandas 2.0. It called pd.concat() internally but created a copy on every call. When combining many DataFrames, collecting them in a list and calling pd.concat() once is significantly faster because it allocates memory a single time.
How do I reset the index after concatenation?
Pass ignore_index=True to pd.concat(): pd.concat([df1, df2], ignore_index=True). This replaces the original index values with a new sequential index starting from 0. Alternatively, call .reset_index(drop=True) on the result.
Conclusion
The pandas concat() function is the go-to tool for combining DataFrames that share the same structure. Here are the key takeaways:
- Vertical concatenation (
axis=0) stacks rows and is the most common use case -- ideal for combining monthly files, batch results, or split datasets. - Horizontal concatenation (
axis=1) places DataFrames side by side, aligning on the index. - Use
ignore_index=Trueto get a clean sequential index (recommended in most cases). - Use
keysto create a hierarchical index that tracks which source each row came from. - The
joinparameter controls how mismatched columns are handled:'outer'keeps everything,'inner'keeps only shared columns. - Always collect DataFrames in a list and call
pd.concat()once, rather than appending in a loop. - Use
pd.merge()instead when you need SQL-style joins on column values.
Once your data is concatenated, tools like PyGWalker (opens in a new tab) let you visually explore the combined result without writing chart code, making it faster to verify your data pipeline and discover patterns across sources.