Skip to content
Topics
Pandas
Pandas Drop Duplicates: How to Remove Duplicate Rows in Python

Pandas Drop Duplicates: How to Remove Duplicate Rows in Python

Updated on

Duplicate rows are one of the most common data quality problems in real-world datasets. They sneak in through repeated API calls, overlapping CSV exports, broken ETL pipelines, or simple copy-paste errors during manual data entry. Left unchecked, duplicates inflate row counts, distort averages and sums, and introduce bias into machine learning models. A dataset that looks like it has 10,000 customer records might really have 8,200 unique customers and 1,800 phantom entries silently corrupting every calculation built on top of it.

The pandas drop_duplicates() method is the standard way to detect and remove these redundant rows. This guide walks through every parameter, shows common patterns for real-world deduplication, and covers performance considerations for large datasets. Every code example is copy-ready and includes its expected output.

📚

Why Duplicates Matter

Before jumping into code, it is worth understanding exactly what duplicate rows break:

ProblemWhat HappensExample
Inflated countslen(df) and value_counts() overcountA customer appears 3 times, so "total customers" is 3x too high
Wrong averagesmean() weights duplicated rows more heavilyA high-value order counted twice skews average order value upward
Broken joinsMerging on a key with duplicates causes row explosionA many-to-many merge produces a Cartesian product instead of 1:1 mapping
Bad ML modelsTraining data with repeated samples biases the modelThe model memorizes duplicated examples and overfits
Wasted storageRedundant rows consume disk and memoryA 2 GB dataset could be 1.4 GB after deduplication

The fix is straightforward: find the duplicates, decide which copy to keep (if any), and remove the rest.

Basic Syntax: df.drop_duplicates()

The simplest call removes rows where every column value is identical:

import pandas as pd
 
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
    'age': [30, 25, 30, 35, 25],
    'city': ['NYC', 'LA', 'NYC', 'Chicago', 'LA']
})
 
print("Before:")
print(df)
 
df_clean = df.drop_duplicates()
 
print("\nAfter:")
print(df_clean)

Output:

Before:
      name  age     city
0    Alice   30      NYC
1      Bob   25       LA
2    Alice   30      NYC
3  Charlie   35  Chicago
4      Bob   25       LA

After:
      name  age     city
0    Alice   30      NYC
1      Bob   25       LA
3  Charlie   35  Chicago

Row 2 and row 4 were exact copies of row 0 and row 1, so they were dropped. Notice that the original index values (0, 1, 3) are preserved. If you want a clean sequential index, chain .reset_index(drop=True).

Full Method Signature

DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)
ParameterTypeDefaultDescription
subsetcolumn label or listNone (all columns)Only consider these columns when identifying duplicates
keep'first', 'last', False'first'Which duplicate to keep; False drops all copies
inplaceboolFalseIf True, modifies the DataFrame in place and returns None
ignore_indexboolFalseIf True, resets the index from 0 to n-1 in the result

Finding Duplicates First with df.duplicated()

Before removing duplicates, you often want to inspect them. The duplicated() method returns a boolean Series marking duplicate rows:

df = pd.DataFrame({
    'order_id': [101, 102, 103, 101, 104, 102],
    'product': ['Widget', 'Gadget', 'Widget', 'Widget', 'Gizmo', 'Gadget'],
    'amount': [29.99, 49.99, 29.99, 29.99, 19.99, 49.99]
})
 
# Show which rows are duplicates
print(df.duplicated())

Output:

0    False
1    False
2    False
3     True
4    False
5     True
dtype: bool

To see the actual duplicate rows:

duplicates = df[df.duplicated(keep=False)]
print(duplicates)

Output:

   order_id product  amount
0       101  Widget   29.99
1       102  Gadget   49.99
3       101  Widget   29.99
5       102  Gadget   49.99

Using keep=False marks all copies as duplicates (not just the second occurrence), so you can see every row involved in duplication.

Counting Duplicates

To get a quick count of how many duplicate rows exist:

num_duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {num_duplicates}")
 
total_rows = len(df)
unique_rows = df.drop_duplicates().shape[0]
print(f"Total: {total_rows}, Unique: {unique_rows}, Duplicates: {total_rows - unique_rows}")

Output:

Number of duplicate rows: 2
Total: 6, Unique: 4, Duplicates: 2

This is a useful sanity check before and after cleaning.

The subset Parameter: Check Specific Columns Only

Often you want to deduplicate based on a key column rather than requiring every column to match. The subset parameter lets you specify which columns determine uniqueness:

df = pd.DataFrame({
    'email': ['alice@mail.com', 'bob@mail.com', 'alice@mail.com', 'charlie@mail.com'],
    'name': ['Alice', 'Bob', 'Alice Smith', 'Charlie'],
    'signup_date': ['2025-01-01', '2025-01-02', '2025-01-15', '2025-01-03']
})
 
print("Original:")
print(df)
 
# Deduplicate by email only
df_deduped = df.drop_duplicates(subset=['email'])
print("\nDeduplicated by email:")
print(df_deduped)

Output:

Original:
              email       name signup_date
0    alice@mail.com      Alice  2025-01-01
1      bob@mail.com        Bob  2025-01-02
2    alice@mail.com  Alice Smith  2025-01-15
3  charlie@mail.com    Charlie  2025-01-03

Deduplicated by email:
              email     name signup_date
0    alice@mail.com    Alice  2025-01-01
1      bob@mail.com      Bob  2025-01-02
3  charlie@mail.com  Charlie  2025-01-03

Row 2 was removed because alice@mail.com already appeared in row 0, even though the name and signup_date values differed. You can also pass multiple columns: subset=['email', 'name'] would only consider rows duplicated when both columns match.

The keep Parameter: first, last, or False

The keep parameter controls which occurrence survives:

df = pd.DataFrame({
    'sensor_id': ['S1', 'S2', 'S1', 'S2', 'S1'],
    'reading': [22.5, 18.3, 23.1, 18.3, 24.0],
    'timestamp': ['08:00', '08:00', '09:00', '09:00', '10:00']
})
 
print("keep='first' (default):")
print(df.drop_duplicates(subset=['sensor_id'], keep='first'))
 
print("\nkeep='last':")
print(df.drop_duplicates(subset=['sensor_id'], keep='last'))
 
print("\nkeep=False (drop all duplicates):")
print(df.drop_duplicates(subset=['sensor_id'], keep=False))

Output:

keep='first' (default):
  sensor_id  reading timestamp
0        S1     22.5     08:00
1        S2     18.3     08:00

keep='last':
  sensor_id  reading timestamp
3        S2     18.3     09:00
4        S1     24.0     10:00

keep=False (drop all duplicates):
Empty DataFrame
Columns: [sensor_id, reading, timestamp]
Index: []
keep valueBehaviorUse case
'first'Keep the first occurrence, drop later onesKeep the earliest record
'last'Keep the last occurrence, drop earlier onesKeep the most recent record
FalseDrop all rows that have any duplicateFind rows that are truly unique (no copies at all)

In the example above, keep=False returns an empty DataFrame because both sensor IDs appear more than once.

In-place vs Returning a New DataFrame

By default, drop_duplicates() returns a new DataFrame and leaves the original unchanged. Setting inplace=True modifies the original directly:

df = pd.DataFrame({
    'id': [1, 2, 2, 3],
    'value': ['a', 'b', 'b', 'c']
})
 
# Returns new DataFrame (original unchanged)
df_new = df.drop_duplicates()
print(f"Original length: {len(df)}, New length: {len(df_new)}")
 
# Modifies in place (returns None)
df.drop_duplicates(inplace=True)
print(f"After inplace: {len(df)}")

Output:

Original length: 4, New length: 3
After inplace: 3

Modern pandas style recommends avoiding inplace=True and using assignment instead (df = df.drop_duplicates()). This makes code easier to read and debug, especially in chained operations.

Case-Insensitive Duplicate Detection

By default, pandas treats "Alice" and "alice" as different values. For case-insensitive deduplication, normalize the column first:

df = pd.DataFrame({
    'name': ['Alice', 'alice', 'ALICE', 'Bob', 'bob'],
    'score': [90, 85, 92, 78, 80]
})
 
# Create a normalized column for comparison
df['name_lower'] = df['name'].str.lower()
 
# Deduplicate on the normalized column, keep the first original-case entry
df_deduped = df.drop_duplicates(subset=['name_lower']).drop(columns=['name_lower'])
print(df_deduped)

Output:

    name  score
0  Alice     90
3    Bob     78

This pattern preserves the original casing while correctly identifying case-insensitive duplicates.

Real-World Example: Cleaning a Customer Database

Here is a realistic scenario. You receive a customer export from a CRM where the same customer was entered multiple times by different sales reps:

import pandas as pd
 
customers = pd.DataFrame({
    'customer_id': [1001, 1002, 1001, 1003, 1002, 1004],
    'name': ['Acme Corp', 'Beta Inc', 'Acme Corp', 'Gamma LLC', 'Beta Inc.', 'Delta Co'],
    'email': ['acme@mail.com', 'beta@mail.com', 'acme@mail.com', 'gamma@mail.com', 'beta@mail.com', 'delta@mail.com'],
    'revenue': [50000, 30000, 52000, 45000, 30000, 20000],
    'last_contact': ['2025-12-01', '2025-11-15', '2026-01-10', '2025-10-20', '2025-11-15', '2026-01-05']
})
 
print(f"Rows before cleaning: {len(customers)}")
print(f"Duplicate customer_ids: {customers.duplicated(subset=['customer_id']).sum()}")
 
# Sort by last_contact descending so the most recent entry is first
customers['last_contact'] = pd.to_datetime(customers['last_contact'])
customers = customers.sort_values('last_contact', ascending=False)
 
# Keep the most recent record for each customer_id
customers_clean = customers.drop_duplicates(subset=['customer_id'], keep='first')
customers_clean = customers_clean.sort_values('customer_id').reset_index(drop=True)
 
print(f"\nRows after cleaning: {len(customers_clean)}")
print(customers_clean)

Output:

Rows before cleaning: 6
Duplicate customer_ids: 2

Rows after cleaning: 4
   customer_id       name           email  revenue last_contact
0         1001  Acme Corp   acme@mail.com    52000   2026-01-10
1         1002   Beta Inc  beta@mail.com    30000   2025-11-15
2         1003  Gamma LLC  gamma@mail.com    45000   2025-10-20
3         1004   Delta Co  delta@mail.com    20000   2026-01-05

The key pattern here is sort first, then deduplicate with keep='first'. By sorting on last_contact descending before deduplicating, the most recent record for each customer survives.

Real-World Example: Deduplicating Web Scraping Results

Web scrapers commonly produce duplicates when pages are crawled multiple times or pagination overlaps:

import pandas as pd
 
scraped = pd.DataFrame({
    'url': [
        'https://shop.com/item/101',
        'https://shop.com/item/102',
        'https://shop.com/item/101',
        'https://shop.com/item/103',
        'https://shop.com/item/102',
        'https://shop.com/item/104',
    ],
    'title': ['Blue Widget', 'Red Gadget', 'Blue Widget', 'Green Gizmo', 'Red Gadget', 'Yellow Thing'],
    'price': [19.99, 29.99, 19.99, 39.99, 31.99, 14.99],
    'scraped_at': ['2026-02-01', '2026-02-01', '2026-02-02', '2026-02-02', '2026-02-02', '2026-02-02']
})
 
print(f"Total scraped rows: {len(scraped)}")
print(f"Unique URLs: {scraped['url'].nunique()}")
 
# Keep the latest scrape for each URL (prices may have changed)
scraped['scraped_at'] = pd.to_datetime(scraped['scraped_at'])
scraped = scraped.sort_values('scraped_at', ascending=False)
products = scraped.drop_duplicates(subset=['url'], keep='first').reset_index(drop=True)
 
print(f"\nCleaned rows: {len(products)}")
print(products)

Output:

Total scraped rows: 6
Unique URLs: 4

Cleaned rows: 4
                          url        title  price scraped_at
0  https://shop.com/item/101  Blue Widget  19.99 2026-02-02
1  https://shop.com/item/102   Red Gadget  31.99 2026-02-02
2  https://shop.com/item/103  Green Gizmo  39.99 2026-02-02
3  https://shop.com/item/104  Yellow Thing  14.99 2026-02-02

Notice that the Red Gadget price updated from 29.99 to 31.99 between scrapes. By keeping the latest entry, we capture the most current price.

drop_duplicates() vs groupby().first(): When to Use Which

Both approaches can deduplicate data, but they work differently:

import pandas as pd
 
df = pd.DataFrame({
    'user_id': [1, 1, 2, 2, 3],
    'action': ['login', 'purchase', 'login', 'login', 'purchase'],
    'timestamp': ['2026-01-01', '2026-01-02', '2026-01-01', '2026-01-03', '2026-01-01']
})
 
# Method 1: drop_duplicates
result1 = df.drop_duplicates(subset=['user_id'], keep='first')
print("drop_duplicates:")
print(result1)
 
# Method 2: groupby().first()
result2 = df.groupby('user_id').first().reset_index()
print("\ngroupby().first():")
print(result2)

Output:

drop_duplicates:
   user_id    action   timestamp
0        1     login  2026-01-01
2        2     login  2026-01-01
4        3  purchase  2026-01-01

groupby().first():
   user_id    action   timestamp
0        1     login  2026-01-01
1        2     login  2026-01-01
2        3  purchase  2026-01-01

The results look similar, but there are important differences:

Featuredrop_duplicates()groupby().first()
SpeedFaster for simple deduplicationSlower due to grouping overhead
NaN handlingKeeps NaN values as-isfirst() skips NaN by default
IndexPreserves original indexResets to group keys
AggregationCannot aggregate other columnsCan combine with agg() for multi-column summaries
MemoryLower memory usageCreates intermediate GroupBy object
Use caseRemove exact or partial duplicatesDeduplicate while also computing aggregates

Rule of thumb: Use drop_duplicates() when you simply want to remove duplicate rows. Use groupby().first() (or groupby().agg()) when you also need to aggregate values from the duplicate rows -- for example, summing revenue across duplicate customer records while keeping the first name.

Performance Tips for Large DataFrames

When working with millions of rows, deduplication can become a bottleneck. Here are practical ways to speed it up:

1. Specify subset columns

Checking all columns is slower than checking just the key columns:

# Slower: checks every column
df.drop_duplicates()
 
# Faster: checks only the key column
df.drop_duplicates(subset=['user_id'])

2. Use appropriate dtypes

Convert string columns to category dtype before deduplication if they have low cardinality:

df['status'] = df['status'].astype('category')
df['country'] = df['country'].astype('category')
df_clean = df.drop_duplicates(subset=['status', 'country'])

3. Sort before deduplication only when necessary

Sorting a large DataFrame just to control which row keep='first' selects adds significant time. If you do not care which duplicate survives, skip the sort.

4. Consider chunked processing for very large files

For files that do not fit in memory, process in chunks:

chunks = pd.read_csv('large_file.csv', chunksize=100_000)
seen = set()
clean_chunks = []
 
for chunk in chunks:
    chunk = chunk.drop_duplicates(subset=['id'])
    new_rows = chunk[~chunk['id'].isin(seen)]
    seen.update(new_rows['id'].tolist())
    clean_chunks.append(new_rows)
 
df_clean = pd.concat(clean_chunks, ignore_index=True)

5. Benchmark: typical performance

RowsColumns checkedTime (approx)
100KAll (10 cols)~15 ms
1MAll (10 cols)~150 ms
1M1 column~50 ms
10M1 column~500 ms

These numbers vary by hardware and data types, but the key takeaway is that drop_duplicates() scales linearly and handles most datasets in under a second.

Explore Your Cleaned Data with PyGWalker

After removing duplicates, the next step is usually exploring the cleaned dataset -- checking distributions, spotting outliers, and verifying that deduplication worked as expected. Instead of writing multiple matplotlib or seaborn calls, you can use PyGWalker (opens in a new tab), an open-source Python library that turns any pandas DataFrame into an interactive, Tableau-like visualization interface directly in Jupyter Notebook.

import pandas as pd
import pygwalker as pyg
 
# Your cleaned customer data
customers_clean = pd.DataFrame({
    'customer_id': [1001, 1002, 1003, 1004],
    'name': ['Acme Corp', 'Beta Inc', 'Gamma LLC', 'Delta Co'],
    'revenue': [52000, 30000, 45000, 20000],
    'region': ['East', 'West', 'East', 'South']
})
 
# Launch interactive visualization
walker = pyg.walk(customers_clean)

With PyGWalker, you can drag region to the x-axis and revenue to the y-axis to instantly see revenue distribution by region. You can create bar charts, scatter plots, histograms, and heatmaps by dragging and dropping fields -- no chart code required. It is especially useful after deduplication when you want to verify that your cleaning logic produced sensible results.

Install PyGWalker with pip install pygwalker or try it in Google Colab (opens in a new tab).

FAQ

Does drop_duplicates() modify the original DataFrame?

No, by default drop_duplicates() returns a new DataFrame and leaves the original unchanged. To modify in place, pass inplace=True, but the recommended approach is to use assignment: df = df.drop_duplicates().

How do I drop duplicates based on one column only?

Pass the column name to the subset parameter: df.drop_duplicates(subset=['email']). This keeps the first row for each unique email and drops subsequent rows with the same email, regardless of differences in other columns.

What is the difference between duplicated() and drop_duplicates()?

duplicated() returns a boolean Series marking which rows are duplicates (useful for inspection and counting). drop_duplicates() returns a DataFrame with the duplicate rows removed. Use duplicated() first to understand your data, then drop_duplicates() to clean it.

Can I drop duplicates based on a condition?

Not directly with drop_duplicates(). Instead, sort your DataFrame by the condition column first, then call drop_duplicates() with keep='first'. For example, to keep the row with the highest revenue for each customer: df.sort_values('revenue', ascending=False).drop_duplicates(subset=['customer_id'], keep='first').

How do I handle case-insensitive duplicates?

Create a temporary lowercase column, deduplicate on it, then drop the helper column: df['key'] = df['name'].str.lower() followed by df.drop_duplicates(subset=['key']).drop(columns=['key']). This preserves the original casing while correctly identifying duplicates.

Conclusion

Removing duplicate rows is a foundational step in any data cleaning workflow. The pandas drop_duplicates() method handles the vast majority of deduplication tasks with just a few parameters:

  • Use subset to deduplicate on specific columns rather than all columns.
  • Use keep='first' or keep='last' to control which occurrence survives; use keep=False to remove all copies.
  • Use duplicated() to inspect and count duplicates before removing them.
  • Sort before deduplicating when you need to keep a specific row (e.g., the most recent record).
  • For case-insensitive deduplication, normalize to lowercase in a temporary column first.
  • For large datasets, limit the subset columns and use appropriate dtypes to keep performance fast.

Once your data is clean, tools like PyGWalker (opens in a new tab) let you visually explore the result without writing chart code, helping you verify that deduplication worked correctly and move straight into analysis.

📚