Skip to content
Topics
Pandas
Pandas Merge: The Complete Guide to Merging DataFrames in Python

Pandas Merge: The Complete Guide to Merging DataFrames in Python

Updated on

Working with a single DataFrame is rare in the real world. Most data analysis projects require you to combine data from multiple sources -- sales records with customer profiles, sensor readings with device metadata, or survey responses with demographic tables. The pandas merge function is the standard tool Python developers use to join DataFrames together, and understanding it well is the difference between spending minutes or hours on data preparation.

This guide covers every aspect of pd.merge(): the four main join types, merging on multiple columns, handling duplicate names, and common pitfalls. Every example uses working code you can copy directly into your notebook.

📚

What pd.merge() Does

pd.merge() combines two DataFrames by matching rows based on one or more shared columns (called keys). It works like a SQL JOIN statement. You choose which columns to match on and what type of join to perform, and pandas handles the rest.

Here is the basic syntax:

import pandas as pd
 
result = pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
                  suffixes=('_x', '_y'), indicator=False, validate=None)

Key Parameters Reference

ParameterDescriptionDefault
leftThe first (left) DataFrameRequired
rightThe second (right) DataFrameRequired
howType of join: 'inner', 'left', 'right', 'outer', 'cross''inner'
onColumn name(s) to join on (must exist in both DataFrames)None
left_onColumn(s) from the left DataFrame to use as keysNone
right_onColumn(s) from the right DataFrame to use as keysNone
left_indexUse the left DataFrame's index as the join keyFalse
right_indexUse the right DataFrame's index as the join keyFalse
suffixesSuffixes to apply to overlapping column names('_x', '_y')
indicatorAdd a column showing the source of each rowFalse
validateCheck if the merge is one-to-one, one-to-many, etc.None

If you omit on, left_on, and right_on, pandas will automatically join on all columns that share the same name in both DataFrames.

Sample Data for All Examples

Every example below uses these two DataFrames:

import pandas as pd
 
employees = pd.DataFrame({
    'emp_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'dept_id': [10, 20, 10, 30, 20]
})
 
departments = pd.DataFrame({
    'dept_id': [10, 20, 40],
    'dept_name': ['Engineering', 'Marketing', 'Sales']
})
 
print(employees)
print(departments)

Output:

   emp_id     name  dept_id
0       1    Alice       10
1       2      Bob       20
2       3  Charlie       10
3       4    Diana       30
4       5      Eve       20

   dept_id    dept_name
0       10  Engineering
1       20    Marketing
2       40        Sales

Notice that dept_id 30 exists only in employees, and dept_id 40 exists only in departments. This mismatch is intentional -- it makes the difference between join types obvious.

Inner Merge (Default)

An inner merge returns only the rows where the key exists in both DataFrames. Rows without a match are dropped.

inner_result = pd.merge(employees, departments, on='dept_id', how='inner')
print(inner_result)

Output:

   emp_id     name  dept_id    dept_name
0       1    Alice       10  Engineering
1       3  Charlie       10  Engineering
2       2      Bob       20    Marketing
3       5      Eve       20    Marketing

Diana (dept_id=30) is dropped because department 30 does not exist in the departments table. Department 40 (Sales) is also absent because no employee belongs to it.

When to use inner merge: Use it when you only want complete records -- rows that have valid data on both sides. This is the safest default because it never introduces NaN values from unmatched rows.

Left Merge

A left merge keeps all rows from the left DataFrame and only the matching rows from the right. Where there is no match, the right-side columns are filled with NaN.

left_result = pd.merge(employees, departments, on='dept_id', how='left')
print(left_result)

Output:

   emp_id     name  dept_id    dept_name
0       1    Alice       10  Engineering
1       2      Bob       20    Marketing
2       3  Charlie       10  Engineering
3       4    Diana       30          NaN
4       5      Eve       20    Marketing

Diana is now included, but her dept_name is NaN because department 30 has no entry in the departments table.

When to use left merge: Use it when the left DataFrame is your "primary" dataset and you want to enrich it with additional columns, without losing any rows from the primary table. This is the most common merge type in practice.

Right Merge

A right merge is the mirror of a left merge. It keeps all rows from the right DataFrame and only matching rows from the left.

right_result = pd.merge(employees, departments, on='dept_id', how='right')
print(right_result)

Output:

   emp_id     name  dept_id    dept_name
0     1.0    Alice       10  Engineering
1     3.0  Charlie       10  Engineering
2     2.0      Bob       20    Marketing
3     5.0      Eve       20    Marketing
4     NaN      NaN       40        Sales

The Sales department (dept_id=40) now appears even though no employee belongs to it. The employee fields are NaN for that row.

When to use right merge: In practice, right merges are rare. You can always rewrite a right merge as a left merge by swapping the DataFrame order. Most codebases prefer left merges for consistency.

Outer Merge (Full Outer Join)

An outer merge returns all rows from both DataFrames. Where there is no match, missing values are filled with NaN.

outer_result = pd.merge(employees, departments, on='dept_id', how='outer')
print(outer_result)

Output:

   emp_id     name  dept_id    dept_name
0     1.0    Alice       10  Engineering
1     3.0  Charlie       10  Engineering
2     2.0      Bob       20    Marketing
3     5.0      Eve       20    Marketing
4     4.0    Diana       30          NaN
5     NaN      NaN       40        Sales

Both Diana (no matching department) and Sales (no matching employees) appear in the result.

When to use outer merge: Use it when you need a complete picture of both datasets and want to identify which rows failed to match. The indicator parameter is especially helpful here.

Quick Comparison of Merge Types

Merge TypeKeeps from LeftKeeps from RightUnmatched Rows
innerOnly matchedOnly matchedDropped
leftAll rowsOnly matchedLeft rows kept, right fills NaN
rightOnly matchedAll rowsRight rows kept, left fills NaN
outerAll rowsAll rowsBoth sides kept, NaN where no match
crossAll rowsAll rowsCartesian product (every combination)

Merging on Multiple Columns

When a single column is not enough to uniquely identify a match, pass a list of column names to on:

sales = pd.DataFrame({
    'year': [2024, 2024, 2025, 2025],
    'quarter': ['Q1', 'Q2', 'Q1', 'Q2'],
    'revenue': [100, 150, 200, 250]
})
 
targets = pd.DataFrame({
    'year': [2024, 2024, 2025, 2025],
    'quarter': ['Q1', 'Q2', 'Q1', 'Q2'],
    'target': [120, 140, 210, 230]
})
 
merged = pd.merge(sales, targets, on=['year', 'quarter'])
print(merged)

Output:

   year quarter  revenue  target
0  2024      Q1      100     120
1  2024      Q2      150     140
2  2025      Q1      200     210
3  2025      Q2      250     230

This is the equivalent of a composite key in SQL. Both year and quarter must match for rows to be joined.

Merging on Different Column Names

Sometimes the two DataFrames use different names for the same concept. Use left_on and right_on instead of on:

orders = pd.DataFrame({
    'order_id': [101, 102, 103],
    'customer_id': [1, 2, 3],
    'amount': [50.0, 75.0, 120.0]
})
 
customers = pd.DataFrame({
    'id': [1, 2, 4],
    'name': ['Alice', 'Bob', 'Diana']
})
 
merged = pd.merge(orders, customers, left_on='customer_id', right_on='id', how='left')
print(merged)

Output:

   order_id  customer_id  amount   id   name
0       101            1    50.0  1.0  Alice
1       102            2    75.0  2.0    Bob
2       103            3   120.0  NaN    NaN

Note that both customer_id and id columns appear in the result. You can drop the duplicate afterward:

merged = merged.drop(columns=['id'])

Handling Duplicate Column Names with Suffixes

When both DataFrames have columns with the same name (besides the key), pandas appends suffixes to distinguish them:

df1 = pd.DataFrame({
    'id': [1, 2, 3],
    'score': [85, 90, 78]
})
 
df2 = pd.DataFrame({
    'id': [1, 2, 3],
    'score': [88, 92, 80]
})
 
# Default suffixes
merged = pd.merge(df1, df2, on='id')
print(merged)

Output:

   id  score_x  score_y
0   1       85       88
1   2       90       92
2   3       78       80

You can customize the suffixes to make column names more meaningful:

merged = pd.merge(df1, df2, on='id', suffixes=('_midterm', '_final'))
print(merged)

Output:

   id  score_midterm  score_final
0   1             85           88
1   2             90           92
2   3             78           80

Using the Indicator Parameter

The indicator parameter adds a _merge column that tells you where each row came from:

result = pd.merge(employees, departments, on='dept_id', how='outer', indicator=True)
print(result)

Output:

   emp_id     name  dept_id    dept_name      _merge
0     1.0    Alice       10  Engineering        both
1     3.0  Charlie       10  Engineering        both
2     2.0      Bob       20    Marketing        both
3     5.0      Eve       20    Marketing        both
4     4.0    Diana       30          NaN   left_only
5     NaN      NaN       40        Sales  right_only

This is extremely useful for data quality checks -- you can quickly filter rows that failed to match:

unmatched = result[result['_merge'] != 'both']
print(unmatched)

merge() vs join() vs concat() -- When to Use Each

Pandas offers three ways to combine DataFrames. Here is when to use each:

Featurepd.merge()df.join()pd.concat()
Join typeColumn-based (like SQL)Index-based by defaultStack rows or columns
Syntaxpd.merge(df1, df2, on='col')df1.join(df2, on='col')pd.concat([df1, df2])
Best forJoining on shared columnsJoining on indexStacking DataFrames vertically/horizontally
Multiple keysYes (on=['a','b'])LimitedN/A
Multiple DataFramesTwo at a timeTwo at a timeAny number at once
Default joinInnerLeftOuter (axis=0)
FlexibilityHighestMediumDifferent use case

Rule of thumb:

  • Use pd.merge() when you are combining DataFrames based on column values (the most common case).
  • Use df.join() when your join key is the index and you want shorter syntax.
  • Use pd.concat() when you are stacking DataFrames on top of each other (appending rows) or side by side.

Common Errors and Troubleshooting

1. MergeError: columns overlap but no suffix specified

This happens when both DataFrames have non-key columns with the same name and you set suffixes=(False, False):

# Fix: use meaningful suffixes
merged = pd.merge(df1, df2, on='id', suffixes=('_left', '_right'))

2. Unexpected row explosion (many-to-many merge)

If both DataFrames have duplicate values in the join key, pandas creates a Cartesian product of the matching rows. This can cause your result to have far more rows than either input:

# Check for duplicates before merging
print(df1['key'].duplicated().sum())
print(df2['key'].duplicated().sum())
 
# Use validate to catch this early
merged = pd.merge(df1, df2, on='key', validate='one_to_many')

The validate parameter accepts 'one_to_one', 'one_to_many', 'many_to_one', and 'many_to_many'. It raises a MergeError if the data does not match the expected cardinality.

3. Key column has different dtypes

If one DataFrame stores the key as int64 and the other as object (string), the merge will either fail or produce zero matches:

# Check dtypes
print(df1['id'].dtype)  # int64
print(df2['id'].dtype)  # object
 
# Fix: cast to the same type
df2['id'] = df2['id'].astype(int)
merged = pd.merge(df1, df2, on='id')

4. NaN values in keys

Rows with NaN in the join key column will not match anything (NaN != NaN in pandas). Drop or fill them first:

df1 = df1.dropna(subset=['key'])

Visualize Your Merged DataFrames with PyGWalker

After merging your data, the next step is usually exploring the result -- looking at distributions, spotting patterns, and checking for anomalies. Instead of writing dozens of matplotlib or seaborn calls, you can use PyGWalker (opens in a new tab), an open-source Python library that turns any pandas DataFrame into an interactive, Tableau-like visual exploration interface directly inside Jupyter Notebook.

import pandas as pd
import pygwalker as pyg
 
# Merge your DataFrames
employees = pd.DataFrame({
    'emp_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'dept_id': [10, 20, 10, 30, 20],
    'salary': [95000, 82000, 105000, 78000, 91000]
})
 
departments = pd.DataFrame({
    'dept_id': [10, 20, 30],
    'dept_name': ['Engineering', 'Marketing', 'Design']
})
 
merged = pd.merge(employees, departments, on='dept_id')
 
# Launch interactive visualization -- drag and drop to explore
walker = pyg.walk(merged)

With PyGWalker, you can drag dept_name to the x-axis and salary to the y-axis to instantly see salary distribution by department -- no chart code needed. You can create bar charts, scatter plots, histograms, and more just by dragging fields. It is especially powerful for exploring the results of complex merges where you need to verify that the join worked as expected.

You can try PyGWalker right now in Google Colab (opens in a new tab), Kaggle (opens in a new tab), or install it with pip install pygwalker.

FAQ

What is the difference between pandas merge and join?

pd.merge() joins DataFrames based on column values by default and supports all join types (inner, left, right, outer, cross). df.join() joins on the index by default and uses a left join. Under the hood, join() calls merge(), so they produce the same results -- merge() simply gives you more control over which columns to match on.

How do I merge two DataFrames on multiple columns in pandas?

Pass a list of column names to the on parameter: pd.merge(df1, df2, on=['col_a', 'col_b']). Both columns must match for rows to be joined. This is equivalent to a composite key in SQL.

Why does my pandas merge create duplicate rows?

Duplicate rows appear when the join key has repeated values in one or both DataFrames. Pandas creates a Cartesian product of all matching rows. Use df.duplicated(subset=['key']).sum() to check for duplicates before merging, or use validate='one_to_one' to catch it early.

What is a cross merge in pandas?

A cross merge (how='cross') produces the Cartesian product of both DataFrames -- every row from the left is paired with every row from the right. If the left has 3 rows and the right has 4 rows, the result has 12 rows. It is useful for generating all possible combinations, such as pairing every product with every store location.

How do I perform a pandas merge on index?

Set left_index=True and/or right_index=True: pd.merge(df1, df2, left_index=True, right_index=True). You can also mix index and column keys: pd.merge(df1, df2, left_on='col_a', right_index=True).

Conclusion

The pandas merge() function is the most flexible and widely used tool for combining DataFrames in Python. To recap the key points:

  • Inner merge (default) keeps only rows that match in both DataFrames.
  • Left merge keeps all rows from the left DataFrame, filling NaN where the right has no match.
  • Right merge keeps all rows from the right DataFrame -- but prefer left merge with swapped order for consistency.
  • Outer merge keeps everything from both sides, useful for identifying unmatched records with indicator=True.
  • Use on for same-name keys, left_on/right_on for different-name keys, and suffixes to handle overlapping columns.
  • Always check for duplicate keys and mismatched dtypes before merging to avoid unexpected results.

Once your data is merged, tools like PyGWalker (opens in a new tab) let you visually explore the result without writing chart code, making the entire analysis workflow faster and more intuitive.

📚