Pandas Drop Duplicates: How to Remove Duplicate Rows in Python
Updated on
Duplicate rows are one of the most common data quality problems in real-world datasets. They sneak in through repeated API calls, overlapping CSV exports, broken ETL pipelines, or simple copy-paste errors during manual data entry. Left unchecked, duplicates inflate row counts, distort averages and sums, and introduce bias into machine learning models. A dataset that looks like it has 10,000 customer records might really have 8,200 unique customers and 1,800 phantom entries silently corrupting every calculation built on top of it.
The pandas drop_duplicates() method is the standard way to detect and remove these redundant rows. This guide walks through every parameter, shows common patterns for real-world deduplication, and covers performance considerations for large datasets. Every code example is copy-ready and includes its expected output.
Why Duplicates Matter
Before jumping into code, it is worth understanding exactly what duplicate rows break:
| Problem | What Happens | Example |
|---|---|---|
| Inflated counts | len(df) and value_counts() overcount | A customer appears 3 times, so "total customers" is 3x too high |
| Wrong averages | mean() weights duplicated rows more heavily | A high-value order counted twice skews average order value upward |
| Broken joins | Merging on a key with duplicates causes row explosion | A many-to-many merge produces a Cartesian product instead of 1:1 mapping |
| Bad ML models | Training data with repeated samples biases the model | The model memorizes duplicated examples and overfits |
| Wasted storage | Redundant rows consume disk and memory | A 2 GB dataset could be 1.4 GB after deduplication |
The fix is straightforward: find the duplicates, decide which copy to keep (if any), and remove the rest.
Basic Syntax: df.drop_duplicates()
The simplest call removes rows where every column value is identical:
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
'age': [30, 25, 30, 35, 25],
'city': ['NYC', 'LA', 'NYC', 'Chicago', 'LA']
})
print("Before:")
print(df)
df_clean = df.drop_duplicates()
print("\nAfter:")
print(df_clean)Output:
Before:
name age city
0 Alice 30 NYC
1 Bob 25 LA
2 Alice 30 NYC
3 Charlie 35 Chicago
4 Bob 25 LA
After:
name age city
0 Alice 30 NYC
1 Bob 25 LA
3 Charlie 35 ChicagoRow 2 and row 4 were exact copies of row 0 and row 1, so they were dropped. Notice that the original index values (0, 1, 3) are preserved. If you want a clean sequential index, chain .reset_index(drop=True).
Full Method Signature
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)| Parameter | Type | Default | Description |
|---|---|---|---|
subset | column label or list | None (all columns) | Only consider these columns when identifying duplicates |
keep | 'first', 'last', False | 'first' | Which duplicate to keep; False drops all copies |
inplace | bool | False | If True, modifies the DataFrame in place and returns None |
ignore_index | bool | False | If True, resets the index from 0 to n-1 in the result |
Finding Duplicates First with df.duplicated()
Before removing duplicates, you often want to inspect them. The duplicated() method returns a boolean Series marking duplicate rows:
df = pd.DataFrame({
'order_id': [101, 102, 103, 101, 104, 102],
'product': ['Widget', 'Gadget', 'Widget', 'Widget', 'Gizmo', 'Gadget'],
'amount': [29.99, 49.99, 29.99, 29.99, 19.99, 49.99]
})
# Show which rows are duplicates
print(df.duplicated())Output:
0 False
1 False
2 False
3 True
4 False
5 True
dtype: boolTo see the actual duplicate rows:
duplicates = df[df.duplicated(keep=False)]
print(duplicates)Output:
order_id product amount
0 101 Widget 29.99
1 102 Gadget 49.99
3 101 Widget 29.99
5 102 Gadget 49.99Using keep=False marks all copies as duplicates (not just the second occurrence), so you can see every row involved in duplication.
Counting Duplicates
To get a quick count of how many duplicate rows exist:
num_duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {num_duplicates}")
total_rows = len(df)
unique_rows = df.drop_duplicates().shape[0]
print(f"Total: {total_rows}, Unique: {unique_rows}, Duplicates: {total_rows - unique_rows}")Output:
Number of duplicate rows: 2
Total: 6, Unique: 4, Duplicates: 2This is a useful sanity check before and after cleaning.
The subset Parameter: Check Specific Columns Only
Often you want to deduplicate based on a key column rather than requiring every column to match. The subset parameter lets you specify which columns determine uniqueness:
df = pd.DataFrame({
'email': ['alice@mail.com', 'bob@mail.com', 'alice@mail.com', 'charlie@mail.com'],
'name': ['Alice', 'Bob', 'Alice Smith', 'Charlie'],
'signup_date': ['2025-01-01', '2025-01-02', '2025-01-15', '2025-01-03']
})
print("Original:")
print(df)
# Deduplicate by email only
df_deduped = df.drop_duplicates(subset=['email'])
print("\nDeduplicated by email:")
print(df_deduped)Output:
Original:
email name signup_date
0 alice@mail.com Alice 2025-01-01
1 bob@mail.com Bob 2025-01-02
2 alice@mail.com Alice Smith 2025-01-15
3 charlie@mail.com Charlie 2025-01-03
Deduplicated by email:
email name signup_date
0 alice@mail.com Alice 2025-01-01
1 bob@mail.com Bob 2025-01-02
3 charlie@mail.com Charlie 2025-01-03Row 2 was removed because alice@mail.com already appeared in row 0, even though the name and signup_date values differed. You can also pass multiple columns: subset=['email', 'name'] would only consider rows duplicated when both columns match.
The keep Parameter: first, last, or False
The keep parameter controls which occurrence survives:
df = pd.DataFrame({
'sensor_id': ['S1', 'S2', 'S1', 'S2', 'S1'],
'reading': [22.5, 18.3, 23.1, 18.3, 24.0],
'timestamp': ['08:00', '08:00', '09:00', '09:00', '10:00']
})
print("keep='first' (default):")
print(df.drop_duplicates(subset=['sensor_id'], keep='first'))
print("\nkeep='last':")
print(df.drop_duplicates(subset=['sensor_id'], keep='last'))
print("\nkeep=False (drop all duplicates):")
print(df.drop_duplicates(subset=['sensor_id'], keep=False))Output:
keep='first' (default):
sensor_id reading timestamp
0 S1 22.5 08:00
1 S2 18.3 08:00
keep='last':
sensor_id reading timestamp
3 S2 18.3 09:00
4 S1 24.0 10:00
keep=False (drop all duplicates):
Empty DataFrame
Columns: [sensor_id, reading, timestamp]
Index: []| keep value | Behavior | Use case |
|---|---|---|
'first' | Keep the first occurrence, drop later ones | Keep the earliest record |
'last' | Keep the last occurrence, drop earlier ones | Keep the most recent record |
False | Drop all rows that have any duplicate | Find rows that are truly unique (no copies at all) |
In the example above, keep=False returns an empty DataFrame because both sensor IDs appear more than once.
In-place vs Returning a New DataFrame
By default, drop_duplicates() returns a new DataFrame and leaves the original unchanged. Setting inplace=True modifies the original directly:
df = pd.DataFrame({
'id': [1, 2, 2, 3],
'value': ['a', 'b', 'b', 'c']
})
# Returns new DataFrame (original unchanged)
df_new = df.drop_duplicates()
print(f"Original length: {len(df)}, New length: {len(df_new)}")
# Modifies in place (returns None)
df.drop_duplicates(inplace=True)
print(f"After inplace: {len(df)}")Output:
Original length: 4, New length: 3
After inplace: 3Modern pandas style recommends avoiding inplace=True and using assignment instead (df = df.drop_duplicates()). This makes code easier to read and debug, especially in chained operations.
Case-Insensitive Duplicate Detection
By default, pandas treats "Alice" and "alice" as different values. For case-insensitive deduplication, normalize the column first:
df = pd.DataFrame({
'name': ['Alice', 'alice', 'ALICE', 'Bob', 'bob'],
'score': [90, 85, 92, 78, 80]
})
# Create a normalized column for comparison
df['name_lower'] = df['name'].str.lower()
# Deduplicate on the normalized column, keep the first original-case entry
df_deduped = df.drop_duplicates(subset=['name_lower']).drop(columns=['name_lower'])
print(df_deduped)Output:
name score
0 Alice 90
3 Bob 78This pattern preserves the original casing while correctly identifying case-insensitive duplicates.
Real-World Example: Cleaning a Customer Database
Here is a realistic scenario. You receive a customer export from a CRM where the same customer was entered multiple times by different sales reps:
import pandas as pd
customers = pd.DataFrame({
'customer_id': [1001, 1002, 1001, 1003, 1002, 1004],
'name': ['Acme Corp', 'Beta Inc', 'Acme Corp', 'Gamma LLC', 'Beta Inc.', 'Delta Co'],
'email': ['acme@mail.com', 'beta@mail.com', 'acme@mail.com', 'gamma@mail.com', 'beta@mail.com', 'delta@mail.com'],
'revenue': [50000, 30000, 52000, 45000, 30000, 20000],
'last_contact': ['2025-12-01', '2025-11-15', '2026-01-10', '2025-10-20', '2025-11-15', '2026-01-05']
})
print(f"Rows before cleaning: {len(customers)}")
print(f"Duplicate customer_ids: {customers.duplicated(subset=['customer_id']).sum()}")
# Sort by last_contact descending so the most recent entry is first
customers['last_contact'] = pd.to_datetime(customers['last_contact'])
customers = customers.sort_values('last_contact', ascending=False)
# Keep the most recent record for each customer_id
customers_clean = customers.drop_duplicates(subset=['customer_id'], keep='first')
customers_clean = customers_clean.sort_values('customer_id').reset_index(drop=True)
print(f"\nRows after cleaning: {len(customers_clean)}")
print(customers_clean)Output:
Rows before cleaning: 6
Duplicate customer_ids: 2
Rows after cleaning: 4
customer_id name email revenue last_contact
0 1001 Acme Corp acme@mail.com 52000 2026-01-10
1 1002 Beta Inc beta@mail.com 30000 2025-11-15
2 1003 Gamma LLC gamma@mail.com 45000 2025-10-20
3 1004 Delta Co delta@mail.com 20000 2026-01-05The key pattern here is sort first, then deduplicate with keep='first'. By sorting on last_contact descending before deduplicating, the most recent record for each customer survives.
Real-World Example: Deduplicating Web Scraping Results
Web scrapers commonly produce duplicates when pages are crawled multiple times or pagination overlaps:
import pandas as pd
scraped = pd.DataFrame({
'url': [
'https://shop.com/item/101',
'https://shop.com/item/102',
'https://shop.com/item/101',
'https://shop.com/item/103',
'https://shop.com/item/102',
'https://shop.com/item/104',
],
'title': ['Blue Widget', 'Red Gadget', 'Blue Widget', 'Green Gizmo', 'Red Gadget', 'Yellow Thing'],
'price': [19.99, 29.99, 19.99, 39.99, 31.99, 14.99],
'scraped_at': ['2026-02-01', '2026-02-01', '2026-02-02', '2026-02-02', '2026-02-02', '2026-02-02']
})
print(f"Total scraped rows: {len(scraped)}")
print(f"Unique URLs: {scraped['url'].nunique()}")
# Keep the latest scrape for each URL (prices may have changed)
scraped['scraped_at'] = pd.to_datetime(scraped['scraped_at'])
scraped = scraped.sort_values('scraped_at', ascending=False)
products = scraped.drop_duplicates(subset=['url'], keep='first').reset_index(drop=True)
print(f"\nCleaned rows: {len(products)}")
print(products)Output:
Total scraped rows: 6
Unique URLs: 4
Cleaned rows: 4
url title price scraped_at
0 https://shop.com/item/101 Blue Widget 19.99 2026-02-02
1 https://shop.com/item/102 Red Gadget 31.99 2026-02-02
2 https://shop.com/item/103 Green Gizmo 39.99 2026-02-02
3 https://shop.com/item/104 Yellow Thing 14.99 2026-02-02Notice that the Red Gadget price updated from 29.99 to 31.99 between scrapes. By keeping the latest entry, we capture the most current price.
drop_duplicates() vs groupby().first(): When to Use Which
Both approaches can deduplicate data, but they work differently:
import pandas as pd
df = pd.DataFrame({
'user_id': [1, 1, 2, 2, 3],
'action': ['login', 'purchase', 'login', 'login', 'purchase'],
'timestamp': ['2026-01-01', '2026-01-02', '2026-01-01', '2026-01-03', '2026-01-01']
})
# Method 1: drop_duplicates
result1 = df.drop_duplicates(subset=['user_id'], keep='first')
print("drop_duplicates:")
print(result1)
# Method 2: groupby().first()
result2 = df.groupby('user_id').first().reset_index()
print("\ngroupby().first():")
print(result2)Output:
drop_duplicates:
user_id action timestamp
0 1 login 2026-01-01
2 2 login 2026-01-01
4 3 purchase 2026-01-01
groupby().first():
user_id action timestamp
0 1 login 2026-01-01
1 2 login 2026-01-01
2 3 purchase 2026-01-01The results look similar, but there are important differences:
| Feature | drop_duplicates() | groupby().first() |
|---|---|---|
| Speed | Faster for simple deduplication | Slower due to grouping overhead |
| NaN handling | Keeps NaN values as-is | first() skips NaN by default |
| Index | Preserves original index | Resets to group keys |
| Aggregation | Cannot aggregate other columns | Can combine with agg() for multi-column summaries |
| Memory | Lower memory usage | Creates intermediate GroupBy object |
| Use case | Remove exact or partial duplicates | Deduplicate while also computing aggregates |
Rule of thumb: Use drop_duplicates() when you simply want to remove duplicate rows. Use groupby().first() (or groupby().agg()) when you also need to aggregate values from the duplicate rows -- for example, summing revenue across duplicate customer records while keeping the first name.
Performance Tips for Large DataFrames
When working with millions of rows, deduplication can become a bottleneck. Here are practical ways to speed it up:
1. Specify subset columns
Checking all columns is slower than checking just the key columns:
# Slower: checks every column
df.drop_duplicates()
# Faster: checks only the key column
df.drop_duplicates(subset=['user_id'])2. Use appropriate dtypes
Convert string columns to category dtype before deduplication if they have low cardinality:
df['status'] = df['status'].astype('category')
df['country'] = df['country'].astype('category')
df_clean = df.drop_duplicates(subset=['status', 'country'])3. Sort before deduplication only when necessary
Sorting a large DataFrame just to control which row keep='first' selects adds significant time. If you do not care which duplicate survives, skip the sort.
4. Consider chunked processing for very large files
For files that do not fit in memory, process in chunks:
chunks = pd.read_csv('large_file.csv', chunksize=100_000)
seen = set()
clean_chunks = []
for chunk in chunks:
chunk = chunk.drop_duplicates(subset=['id'])
new_rows = chunk[~chunk['id'].isin(seen)]
seen.update(new_rows['id'].tolist())
clean_chunks.append(new_rows)
df_clean = pd.concat(clean_chunks, ignore_index=True)5. Benchmark: typical performance
| Rows | Columns checked | Time (approx) |
|---|---|---|
| 100K | All (10 cols) | ~15 ms |
| 1M | All (10 cols) | ~150 ms |
| 1M | 1 column | ~50 ms |
| 10M | 1 column | ~500 ms |
These numbers vary by hardware and data types, but the key takeaway is that drop_duplicates() scales linearly and handles most datasets in under a second.
Explore Your Cleaned Data with PyGWalker
After removing duplicates, the next step is usually exploring the cleaned dataset -- checking distributions, spotting outliers, and verifying that deduplication worked as expected. Instead of writing multiple matplotlib or seaborn calls, you can use PyGWalker (opens in a new tab), an open-source Python library that turns any pandas DataFrame into an interactive, Tableau-like visualization interface directly in Jupyter Notebook.
import pandas as pd
import pygwalker as pyg
# Your cleaned customer data
customers_clean = pd.DataFrame({
'customer_id': [1001, 1002, 1003, 1004],
'name': ['Acme Corp', 'Beta Inc', 'Gamma LLC', 'Delta Co'],
'revenue': [52000, 30000, 45000, 20000],
'region': ['East', 'West', 'East', 'South']
})
# Launch interactive visualization
walker = pyg.walk(customers_clean)With PyGWalker, you can drag region to the x-axis and revenue to the y-axis to instantly see revenue distribution by region. You can create bar charts, scatter plots, histograms, and heatmaps by dragging and dropping fields -- no chart code required. It is especially useful after deduplication when you want to verify that your cleaning logic produced sensible results.
Install PyGWalker with
pip install pygwalkeror try it in Google Colab (opens in a new tab).
FAQ
Does drop_duplicates() modify the original DataFrame?
No, by default drop_duplicates() returns a new DataFrame and leaves the original unchanged. To modify in place, pass inplace=True, but the recommended approach is to use assignment: df = df.drop_duplicates().
How do I drop duplicates based on one column only?
Pass the column name to the subset parameter: df.drop_duplicates(subset=['email']). This keeps the first row for each unique email and drops subsequent rows with the same email, regardless of differences in other columns.
What is the difference between duplicated() and drop_duplicates()?
duplicated() returns a boolean Series marking which rows are duplicates (useful for inspection and counting). drop_duplicates() returns a DataFrame with the duplicate rows removed. Use duplicated() first to understand your data, then drop_duplicates() to clean it.
Can I drop duplicates based on a condition?
Not directly with drop_duplicates(). Instead, sort your DataFrame by the condition column first, then call drop_duplicates() with keep='first'. For example, to keep the row with the highest revenue for each customer: df.sort_values('revenue', ascending=False).drop_duplicates(subset=['customer_id'], keep='first').
How do I handle case-insensitive duplicates?
Create a temporary lowercase column, deduplicate on it, then drop the helper column: df['key'] = df['name'].str.lower() followed by df.drop_duplicates(subset=['key']).drop(columns=['key']). This preserves the original casing while correctly identifying duplicates.
Conclusion
Removing duplicate rows is a foundational step in any data cleaning workflow. The pandas drop_duplicates() method handles the vast majority of deduplication tasks with just a few parameters:
- Use
subsetto deduplicate on specific columns rather than all columns. - Use
keep='first'orkeep='last'to control which occurrence survives; usekeep=Falseto remove all copies. - Use
duplicated()to inspect and count duplicates before removing them. - Sort before deduplicating when you need to keep a specific row (e.g., the most recent record).
- For case-insensitive deduplication, normalize to lowercase in a temporary column first.
- For large datasets, limit the
subsetcolumns and use appropriate dtypes to keep performance fast.
Once your data is clean, tools like PyGWalker (opens in a new tab) let you visually explore the result without writing chart code, helping you verify that deduplication worked correctly and move straight into analysis.