Pandas Merge: The Complete Guide to Merging DataFrames in Python
Updated on
Working with a single DataFrame is rare in the real world. Most data analysis projects require you to combine data from multiple sources -- sales records with customer profiles, sensor readings with device metadata, or survey responses with demographic tables. The pandas merge function is the standard tool Python developers use to join DataFrames together, and understanding it well is the difference between spending minutes or hours on data preparation.
This guide covers every aspect of pd.merge(): the four main join types, merging on multiple columns, handling duplicate names, and common pitfalls. Every example uses working code you can copy directly into your notebook.
What pd.merge() Does
pd.merge() combines two DataFrames by matching rows based on one or more shared columns (called keys). It works like a SQL JOIN statement. You choose which columns to match on and what type of join to perform, and pandas handles the rest.
Here is the basic syntax:
import pandas as pd
result = pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
suffixes=('_x', '_y'), indicator=False, validate=None)Key Parameters Reference
| Parameter | Description | Default |
|---|---|---|
left | The first (left) DataFrame | Required |
right | The second (right) DataFrame | Required |
how | Type of join: 'inner', 'left', 'right', 'outer', 'cross' | 'inner' |
on | Column name(s) to join on (must exist in both DataFrames) | None |
left_on | Column(s) from the left DataFrame to use as keys | None |
right_on | Column(s) from the right DataFrame to use as keys | None |
left_index | Use the left DataFrame's index as the join key | False |
right_index | Use the right DataFrame's index as the join key | False |
suffixes | Suffixes to apply to overlapping column names | ('_x', '_y') |
indicator | Add a column showing the source of each row | False |
validate | Check if the merge is one-to-one, one-to-many, etc. | None |
If you omit on, left_on, and right_on, pandas will automatically join on all columns that share the same name in both DataFrames.
Sample Data for All Examples
Every example below uses these two DataFrames:
import pandas as pd
employees = pd.DataFrame({
'emp_id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'dept_id': [10, 20, 10, 30, 20]
})
departments = pd.DataFrame({
'dept_id': [10, 20, 40],
'dept_name': ['Engineering', 'Marketing', 'Sales']
})
print(employees)
print(departments)Output:
emp_id name dept_id
0 1 Alice 10
1 2 Bob 20
2 3 Charlie 10
3 4 Diana 30
4 5 Eve 20
dept_id dept_name
0 10 Engineering
1 20 Marketing
2 40 SalesNotice that dept_id 30 exists only in employees, and dept_id 40 exists only in departments. This mismatch is intentional -- it makes the difference between join types obvious.
Inner Merge (Default)
An inner merge returns only the rows where the key exists in both DataFrames. Rows without a match are dropped.
inner_result = pd.merge(employees, departments, on='dept_id', how='inner')
print(inner_result)Output:
emp_id name dept_id dept_name
0 1 Alice 10 Engineering
1 3 Charlie 10 Engineering
2 2 Bob 20 Marketing
3 5 Eve 20 MarketingDiana (dept_id=30) is dropped because department 30 does not exist in the departments table. Department 40 (Sales) is also absent because no employee belongs to it.
When to use inner merge: Use it when you only want complete records -- rows that have valid data on both sides. This is the safest default because it never introduces NaN values from unmatched rows.
Left Merge
A left merge keeps all rows from the left DataFrame and only the matching rows from the right. Where there is no match, the right-side columns are filled with NaN.
left_result = pd.merge(employees, departments, on='dept_id', how='left')
print(left_result)Output:
emp_id name dept_id dept_name
0 1 Alice 10 Engineering
1 2 Bob 20 Marketing
2 3 Charlie 10 Engineering
3 4 Diana 30 NaN
4 5 Eve 20 MarketingDiana is now included, but her dept_name is NaN because department 30 has no entry in the departments table.
When to use left merge: Use it when the left DataFrame is your "primary" dataset and you want to enrich it with additional columns, without losing any rows from the primary table. This is the most common merge type in practice.
Right Merge
A right merge is the mirror of a left merge. It keeps all rows from the right DataFrame and only matching rows from the left.
right_result = pd.merge(employees, departments, on='dept_id', how='right')
print(right_result)Output:
emp_id name dept_id dept_name
0 1.0 Alice 10 Engineering
1 3.0 Charlie 10 Engineering
2 2.0 Bob 20 Marketing
3 5.0 Eve 20 Marketing
4 NaN NaN 40 SalesThe Sales department (dept_id=40) now appears even though no employee belongs to it. The employee fields are NaN for that row.
When to use right merge: In practice, right merges are rare. You can always rewrite a right merge as a left merge by swapping the DataFrame order. Most codebases prefer left merges for consistency.
Outer Merge (Full Outer Join)
An outer merge returns all rows from both DataFrames. Where there is no match, missing values are filled with NaN.
outer_result = pd.merge(employees, departments, on='dept_id', how='outer')
print(outer_result)Output:
emp_id name dept_id dept_name
0 1.0 Alice 10 Engineering
1 3.0 Charlie 10 Engineering
2 2.0 Bob 20 Marketing
3 5.0 Eve 20 Marketing
4 4.0 Diana 30 NaN
5 NaN NaN 40 SalesBoth Diana (no matching department) and Sales (no matching employees) appear in the result.
When to use outer merge: Use it when you need a complete picture of both datasets and want to identify which rows failed to match. The indicator parameter is especially helpful here.
Quick Comparison of Merge Types
| Merge Type | Keeps from Left | Keeps from Right | Unmatched Rows |
|---|---|---|---|
inner | Only matched | Only matched | Dropped |
left | All rows | Only matched | Left rows kept, right fills NaN |
right | Only matched | All rows | Right rows kept, left fills NaN |
outer | All rows | All rows | Both sides kept, NaN where no match |
cross | All rows | All rows | Cartesian product (every combination) |
Merging on Multiple Columns
When a single column is not enough to uniquely identify a match, pass a list of column names to on:
sales = pd.DataFrame({
'year': [2024, 2024, 2025, 2025],
'quarter': ['Q1', 'Q2', 'Q1', 'Q2'],
'revenue': [100, 150, 200, 250]
})
targets = pd.DataFrame({
'year': [2024, 2024, 2025, 2025],
'quarter': ['Q1', 'Q2', 'Q1', 'Q2'],
'target': [120, 140, 210, 230]
})
merged = pd.merge(sales, targets, on=['year', 'quarter'])
print(merged)Output:
year quarter revenue target
0 2024 Q1 100 120
1 2024 Q2 150 140
2 2025 Q1 200 210
3 2025 Q2 250 230This is the equivalent of a composite key in SQL. Both year and quarter must match for rows to be joined.
Merging on Different Column Names
Sometimes the two DataFrames use different names for the same concept. Use left_on and right_on instead of on:
orders = pd.DataFrame({
'order_id': [101, 102, 103],
'customer_id': [1, 2, 3],
'amount': [50.0, 75.0, 120.0]
})
customers = pd.DataFrame({
'id': [1, 2, 4],
'name': ['Alice', 'Bob', 'Diana']
})
merged = pd.merge(orders, customers, left_on='customer_id', right_on='id', how='left')
print(merged)Output:
order_id customer_id amount id name
0 101 1 50.0 1.0 Alice
1 102 2 75.0 2.0 Bob
2 103 3 120.0 NaN NaNNote that both customer_id and id columns appear in the result. You can drop the duplicate afterward:
merged = merged.drop(columns=['id'])Handling Duplicate Column Names with Suffixes
When both DataFrames have columns with the same name (besides the key), pandas appends suffixes to distinguish them:
df1 = pd.DataFrame({
'id': [1, 2, 3],
'score': [85, 90, 78]
})
df2 = pd.DataFrame({
'id': [1, 2, 3],
'score': [88, 92, 80]
})
# Default suffixes
merged = pd.merge(df1, df2, on='id')
print(merged)Output:
id score_x score_y
0 1 85 88
1 2 90 92
2 3 78 80You can customize the suffixes to make column names more meaningful:
merged = pd.merge(df1, df2, on='id', suffixes=('_midterm', '_final'))
print(merged)Output:
id score_midterm score_final
0 1 85 88
1 2 90 92
2 3 78 80Using the Indicator Parameter
The indicator parameter adds a _merge column that tells you where each row came from:
result = pd.merge(employees, departments, on='dept_id', how='outer', indicator=True)
print(result)Output:
emp_id name dept_id dept_name _merge
0 1.0 Alice 10 Engineering both
1 3.0 Charlie 10 Engineering both
2 2.0 Bob 20 Marketing both
3 5.0 Eve 20 Marketing both
4 4.0 Diana 30 NaN left_only
5 NaN NaN 40 Sales right_onlyThis is extremely useful for data quality checks -- you can quickly filter rows that failed to match:
unmatched = result[result['_merge'] != 'both']
print(unmatched)merge() vs join() vs concat() -- When to Use Each
Pandas offers three ways to combine DataFrames. Here is when to use each:
| Feature | pd.merge() | df.join() | pd.concat() |
|---|---|---|---|
| Join type | Column-based (like SQL) | Index-based by default | Stack rows or columns |
| Syntax | pd.merge(df1, df2, on='col') | df1.join(df2, on='col') | pd.concat([df1, df2]) |
| Best for | Joining on shared columns | Joining on index | Stacking DataFrames vertically/horizontally |
| Multiple keys | Yes (on=['a','b']) | Limited | N/A |
| Multiple DataFrames | Two at a time | Two at a time | Any number at once |
| Default join | Inner | Left | Outer (axis=0) |
| Flexibility | Highest | Medium | Different use case |
Rule of thumb:
- Use
pd.merge()when you are combining DataFrames based on column values (the most common case). - Use
df.join()when your join key is the index and you want shorter syntax. - Use
pd.concat()when you are stacking DataFrames on top of each other (appending rows) or side by side.
Common Errors and Troubleshooting
1. MergeError: columns overlap but no suffix specified
This happens when both DataFrames have non-key columns with the same name and you set suffixes=(False, False):
# Fix: use meaningful suffixes
merged = pd.merge(df1, df2, on='id', suffixes=('_left', '_right'))2. Unexpected row explosion (many-to-many merge)
If both DataFrames have duplicate values in the join key, pandas creates a Cartesian product of the matching rows. This can cause your result to have far more rows than either input:
# Check for duplicates before merging
print(df1['key'].duplicated().sum())
print(df2['key'].duplicated().sum())
# Use validate to catch this early
merged = pd.merge(df1, df2, on='key', validate='one_to_many')The validate parameter accepts 'one_to_one', 'one_to_many', 'many_to_one', and 'many_to_many'. It raises a MergeError if the data does not match the expected cardinality.
3. Key column has different dtypes
If one DataFrame stores the key as int64 and the other as object (string), the merge will either fail or produce zero matches:
# Check dtypes
print(df1['id'].dtype) # int64
print(df2['id'].dtype) # object
# Fix: cast to the same type
df2['id'] = df2['id'].astype(int)
merged = pd.merge(df1, df2, on='id')4. NaN values in keys
Rows with NaN in the join key column will not match anything (NaN != NaN in pandas). Drop or fill them first:
df1 = df1.dropna(subset=['key'])Visualize Your Merged DataFrames with PyGWalker
After merging your data, the next step is usually exploring the result -- looking at distributions, spotting patterns, and checking for anomalies. Instead of writing dozens of matplotlib or seaborn calls, you can use PyGWalker (opens in a new tab), an open-source Python library that turns any pandas DataFrame into an interactive, Tableau-like visual exploration interface directly inside Jupyter Notebook.
import pandas as pd
import pygwalker as pyg
# Merge your DataFrames
employees = pd.DataFrame({
'emp_id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'dept_id': [10, 20, 10, 30, 20],
'salary': [95000, 82000, 105000, 78000, 91000]
})
departments = pd.DataFrame({
'dept_id': [10, 20, 30],
'dept_name': ['Engineering', 'Marketing', 'Design']
})
merged = pd.merge(employees, departments, on='dept_id')
# Launch interactive visualization -- drag and drop to explore
walker = pyg.walk(merged)With PyGWalker, you can drag dept_name to the x-axis and salary to the y-axis to instantly see salary distribution by department -- no chart code needed. You can create bar charts, scatter plots, histograms, and more just by dragging fields. It is especially powerful for exploring the results of complex merges where you need to verify that the join worked as expected.
You can try PyGWalker right now in Google Colab (opens in a new tab), Kaggle (opens in a new tab), or install it with
pip install pygwalker.
FAQ
What is the difference between pandas merge and join?
pd.merge() joins DataFrames based on column values by default and supports all join types (inner, left, right, outer, cross). df.join() joins on the index by default and uses a left join. Under the hood, join() calls merge(), so they produce the same results -- merge() simply gives you more control over which columns to match on.
How do I merge two DataFrames on multiple columns in pandas?
Pass a list of column names to the on parameter: pd.merge(df1, df2, on=['col_a', 'col_b']). Both columns must match for rows to be joined. This is equivalent to a composite key in SQL.
Why does my pandas merge create duplicate rows?
Duplicate rows appear when the join key has repeated values in one or both DataFrames. Pandas creates a Cartesian product of all matching rows. Use df.duplicated(subset=['key']).sum() to check for duplicates before merging, or use validate='one_to_one' to catch it early.
What is a cross merge in pandas?
A cross merge (how='cross') produces the Cartesian product of both DataFrames -- every row from the left is paired with every row from the right. If the left has 3 rows and the right has 4 rows, the result has 12 rows. It is useful for generating all possible combinations, such as pairing every product with every store location.
How do I perform a pandas merge on index?
Set left_index=True and/or right_index=True: pd.merge(df1, df2, left_index=True, right_index=True). You can also mix index and column keys: pd.merge(df1, df2, left_on='col_a', right_index=True).
Conclusion
The pandas merge() function is the most flexible and widely used tool for combining DataFrames in Python. To recap the key points:
- Inner merge (default) keeps only rows that match in both DataFrames.
- Left merge keeps all rows from the left DataFrame, filling NaN where the right has no match.
- Right merge keeps all rows from the right DataFrame -- but prefer left merge with swapped order for consistency.
- Outer merge keeps everything from both sides, useful for identifying unmatched records with
indicator=True. - Use
onfor same-name keys,left_on/right_onfor different-name keys, andsuffixesto handle overlapping columns. - Always check for duplicate keys and mismatched dtypes before merging to avoid unexpected results.
Once your data is merged, tools like PyGWalker (opens in a new tab) let you visually explore the result without writing chart code, making the entire analysis workflow faster and more intuitive.