Pandas iterrows(): How to Iterate Over DataFrame Rows (And When Not To)
Updated on
Every data scientist hits the same wall. You have a pandas DataFrame, you need to process each row with some custom logic, and the first thing that comes to mind is a loop. A quick search leads you to iterrows() -- the built-in method that lets you iterate over DataFrame rows as (index, Series) pairs. It works. It reads well. And on a 100-row test dataset, it finishes instantly.
Then you run it on your actual dataset with 500,000 rows. Minutes pass. Your notebook cell is still spinning. What happened?
The problem is not that row iteration is inherently wrong. The problem is that iterrows() carries hidden overhead that makes it 100-1000x slower than the alternatives pandas was designed around. Understanding exactly what iterrows does under the hood, when it is appropriate, and what to use instead separates fast, production-ready code from notebooks that time out on real data.
This guide covers everything you need to know about iterrows(): how it works, why it is slow, and the concrete alternatives that solve the same problems in a fraction of the time.
What iterrows() Returns
The iterrows() method is a generator that yields pairs of (index, Series) for each row in a DataFrame. Each row is converted into a pandas Series object with the column names as the index.
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [28, 34, 22, 45],
'salary': [72000, 85000, 55000, 120000],
'department': ['Engineering', 'Marketing', 'Engineering', 'Executive']
})
for index, row in df.iterrows():
print(f"Index: {index}, Name: {row['name']}, Age: {row['age']}")Output:
Index: 0, Name: Alice, Age: 28
Index: 1, Name: Bob, Age: 34
Index: 2, Name: Charlie, Age: 22
Index: 3, Name: Diana, Age: 45Each row is a pandas Series:
for index, row in df.iterrows():
print(type(row))
print(row)
break # Just show the first rowOutput:
<class 'pandas.core.series.Series'>
name Alice
age 28
salary 72000
department Engineering
Name: 0, dtype: objectNotice the dtype: object. This is the first clue to why iterrows is slow -- but more on that shortly.
Basic iterrows() Usage Patterns
Accessing Column Values
You can access values in each row using dictionary-style bracket notation or dot notation:
import pandas as pd
df = pd.DataFrame({
'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
'price': [999.99, 29.99, 79.99, 349.99],
'stock': [15, 200, 85, 42]
})
for index, row in df.iterrows():
# Bracket notation (recommended)
revenue_potential = row['price'] * row['stock']
print(f"{row['product']}: ${revenue_potential:,.2f}")Output:
Laptop: $14,999.85
Mouse: $5,998.00
Keyboard: $6,799.15
Monitor: $14,699.58Building a List from Row Data
A common use case is constructing a new list or dictionary from row-level computations:
import pandas as pd
df = pd.DataFrame({
'first_name': ['John', 'Jane', 'Bob'],
'last_name': ['Smith', 'Doe', 'Johnson'],
'email_domain': ['gmail.com', 'company.org', 'outlook.com']
})
emails = []
for index, row in df.iterrows():
email = f"{row['first_name'].lower()}.{row['last_name'].lower()}@{row['email_domain']}"
emails.append(email)
df['email'] = emails
print(df)Output:
first_name last_name email_domain email
0 John Smith gmail.com john.smith@gmail.com
1 Jane Doe company.org jane.doe@company.org
2 Bob Johnson outlook.com bob.johnson@outlook.comConditional Logic Per Row
import pandas as pd
df = pd.DataFrame({
'student': ['Alice', 'Bob', 'Charlie', 'Diana'],
'math_score': [92, 67, 85, 45],
'english_score': [78, 88, 90, 72]
})
results = []
for index, row in df.iterrows():
avg = (row['math_score'] + row['english_score']) / 2
if avg >= 85:
results.append('Honors')
elif avg >= 70:
results.append('Pass')
else:
results.append('Needs Improvement')
df['status'] = results
print(df)Output:
student math_score english_score status
0 Alice 92 78 Honors
1 Bob 67 88 Pass
2 Charlie 85 90 Honors
3 Diana 45 72 Needs ImprovementWhy iterrows() Is Slow
Understanding the performance problem requires knowing what happens internally on each iteration:
1. Series Object Creation Overhead
Every single iteration creates a brand-new pandas Series object. For a DataFrame with 1 million rows, that means 1 million Series objects are allocated and garbage collected. Series creation involves memory allocation, index construction, and metadata setup -- none of which are free.
2. Type Casting to Object dtype
When iterrows converts a row into a Series, it must find a single dtype that accommodates all column types. If your DataFrame has integers, floats, and strings (which most do), the only common dtype is object. This forces numeric values to be boxed as Python objects, losing the performance benefits of NumPy's contiguous memory layout.
import pandas as pd
df = pd.DataFrame({
'int_col': [1, 2, 3],
'float_col': [1.5, 2.5, 3.5],
'str_col': ['a', 'b', 'c']
})
print(f"DataFrame dtypes:\n{df.dtypes}\n")
for index, row in df.iterrows():
print(f"Row dtype: {row.dtype}")
print(f"int_col type: {type(row['int_col'])}")
breakOutput:
DataFrame dtypes:
int_col int64
float_col float64
str_col object
dtype: object
Row dtype: object
int_col type: <class 'int'>The integer column that was stored as an efficient int64 NumPy array is now a boxed Python int object. This conversion happens for every row, every iteration.
3. Python-Level Loop Overhead
pandas is built on NumPy, which operates on entire arrays in compiled C code. When you use iterrows, you abandon this advantage and process data one element at a time in the Python interpreter. The Python interpreter adds overhead for each operation: function calls, dynamic type checking, attribute lookups -- all multiplied by the number of rows.
Performance Benchmark
Here is a concrete benchmark comparing iteration approaches:
import pandas as pd
import numpy as np
import timeit
# Create a benchmark DataFrame
n_rows = 100_000
df = pd.DataFrame({
'a': np.random.randn(n_rows),
'b': np.random.randn(n_rows),
'c': np.random.randint(1, 100, n_rows)
})
# Operation: compute a * b + c for each row
# Method 1: iterrows
def method_iterrows():
results = []
for idx, row in df.iterrows():
results.append(row['a'] * row['b'] + row['c'])
return results
# Method 2: itertuples
def method_itertuples():
results = []
for row in df.itertuples():
results.append(row.a * row.b + row.c)
return results
# Method 3: apply
def method_apply():
return df.apply(lambda row: row['a'] * row['b'] + row['c'], axis=1)
# Method 4: vectorized
def method_vectorized():
return df['a'] * df['b'] + df['c']
# Benchmark each method (3 runs)
for name, func in [
('iterrows', method_iterrows),
('itertuples', method_itertuples),
('apply', method_apply),
('vectorized', method_vectorized),
]:
time = timeit.timeit(func, number=3) / 3
print(f"{name:15s}: {time:.4f} seconds")Typical output on a modern machine (100,000 rows):
iterrows : 4.5200 seconds
itertuples : 0.1580 seconds
apply : 1.8900 seconds
vectorized : 0.0008 secondsPerformance Comparison Table
| Method | Speed (100K rows) | Memory Overhead | Type Safety | Readability |
|---|---|---|---|---|
iterrows() | ~4.5s (1x) | High (Series per row) | Poor (casts to object) | High |
itertuples() | ~0.16s (28x faster) | Low (namedtuples) | Good (preserves dtypes) | Medium |
apply(axis=1) | ~1.9s (2.4x faster) | Medium | Poor (casts to object) | High |
| Vectorized ops | ~0.001s (5000x faster) | Minimal | Excellent | Medium |
np.where() | ~0.001s (5000x faster) | Minimal | Excellent | Medium |
np.vectorize() | ~0.08s (56x faster) | Low | Good | Medium |
The key takeaway: vectorized operations are not marginally faster -- they are orders of magnitude faster. For 1 million rows, the difference is between 0.01 seconds and 45 seconds.
iterrows() vs itertuples()
If you must iterate row by row, itertuples() is almost always the better choice. Here is why:
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'age': [28, 34, 22],
'salary': [72000, 85000, 55000]
})
# iterrows: returns (index, Series)
print("=== iterrows ===")
for index, row in df.iterrows():
print(f"Type: {type(row)}, Age type: {type(row['age'])}")
break
# itertuples: returns namedtuples
print("\n=== itertuples ===")
for row in df.itertuples():
print(f"Type: {type(row)}, Age type: {type(row.age)}")
breakOutput:
=== iterrows ===
Type: <class 'pandas.core.series.Series'>, Age type: <class 'int'>
=== itertuples ===
Type: <class 'pandas.core.frame.Pandas'>, Age type: <class 'numpy.int64'>Key differences:
| Feature | iterrows() | itertuples() |
|---|---|---|
| Returns | (index, Series) | Named tuples |
| Speed | Slow (Series creation overhead) | 20-30x faster |
| dtype preservation | Casts to object dtype | Preserves original dtypes |
| Access pattern | row['column_name'] | row.column_name |
| Index access | First element of tuple | row.Index |
| Column names with spaces | Works fine | Renamed to positional |
When to choose itertuples over iterrows
import pandas as pd
df = pd.DataFrame({
'product': ['Widget', 'Gadget', 'Tool'],
'price': [9.99, 24.99, 14.99],
'quantity': [100, 50, 200]
})
# itertuples is faster and preserves types
for row in df.itertuples():
revenue = row.price * row.quantity
print(f"{row.product}: ${revenue:.2f}")
# Use index=False to drop the Index field
for row in df.itertuples(index=False):
print(row)Output:
Widget: $999.00
Gadget: $1249.50
Tool: $2998.00
Pandas(product='Widget', price=9.99, quantity=100)
Pandas(product='Gadget', price=24.99, quantity=50)
Pandas(product='Tool', price=14.99, quantity=200)iterrows() vs apply()
Both iterrows() and apply(axis=1) process data row by row, but they differ in API design and speed:
import pandas as pd
df = pd.DataFrame({
'base_price': [100, 200, 150],
'tax_rate': [0.08, 0.10, 0.06],
'discount': [0.05, 0.15, 0.10]
})
# Using iterrows
results_iterrows = []
for idx, row in df.iterrows():
final = row['base_price'] * (1 + row['tax_rate']) * (1 - row['discount'])
results_iterrows.append(final)
df['final_iterrows'] = results_iterrows
# Using apply
df['final_apply'] = df.apply(
lambda row: row['base_price'] * (1 + row['tax_rate']) * (1 - row['discount']),
axis=1
)
print(df[['final_iterrows', 'final_apply']])Output:
final_iterrows final_apply
0 102.600 102.600
1 187.000 187.000
2 143.100 143.100apply() is typically 2-3x faster than iterrows() because it avoids the tuple unpacking step, but it still has the same fundamental problem of processing one row at a time in Python. For this specific operation, the vectorized version is 1000x faster:
# Vectorized -- the right way
df['final_vectorized'] = df['base_price'] * (1 + df['tax_rate']) * (1 - df['discount'])Vectorized Alternatives: The Right Way
For most operations people use iterrows() for, a vectorized alternative exists that runs dramatically faster.
Arithmetic Operations
import pandas as pd
df = pd.DataFrame({
'price': [10.0, 20.0, 30.0, 40.0],
'quantity': [5, 3, 8, 2],
'tax_rate': [0.08, 0.10, 0.08, 0.12]
})
# SLOW: iterrows
totals = []
for idx, row in df.iterrows():
totals.append(row['price'] * row['quantity'] * (1 + row['tax_rate']))
# FAST: vectorized
df['total'] = df['price'] * df['quantity'] * (1 + df['tax_rate'])
print(df)Conditional Logic with np.where and np.select
import pandas as pd
import numpy as np
df = pd.DataFrame({
'score': [92, 67, 85, 45, 73, 98]
})
# SLOW: iterrows for conditional
grades = []
for idx, row in df.iterrows():
if row['score'] >= 90:
grades.append('A')
elif row['score'] >= 80:
grades.append('B')
elif row['score'] >= 70:
grades.append('C')
else:
grades.append('F')
# FAST: np.select for multiple conditions
conditions = [
df['score'] >= 90,
df['score'] >= 80,
df['score'] >= 70
]
choices = ['A', 'B', 'C']
df['grade'] = np.select(conditions, choices, default='F')
print(df)Output:
score grade
0 92 A
1 67 F
2 85 B
3 45 F
4 73 C
5 98 AString Operations
import pandas as pd
df = pd.DataFrame({
'first': ['john', 'jane', 'bob'],
'last': ['SMITH', 'DOE', 'JOHNSON']
})
# SLOW: iterrows
full_names = []
for idx, row in df.iterrows():
full_names.append(f"{row['first'].title()} {row['last'].title()}")
# FAST: vectorized string operations
df['full_name'] = df['first'].str.title() + ' ' + df['last'].str.title()
print(df)Output:
first last full_name
0 john SMITH John Smith
1 jane DOE Jane Doe
2 bob JOHNSON Bob JohnsonWindow and Rolling Calculations
import pandas as pd
import numpy as np
df = pd.DataFrame({
'date': pd.date_range('2026-01-01', periods=10),
'value': [10, 12, 15, 14, 18, 20, 19, 22, 25, 23]
})
# SLOW: iterrows for rolling average
rolling_avg = []
for idx, row in df.iterrows():
if idx < 2:
rolling_avg.append(np.nan)
else:
avg = df.loc[idx-2:idx, 'value'].mean()
rolling_avg.append(avg)
# FAST: built-in rolling
df['rolling_avg'] = df['value'].rolling(window=3).mean()
print(df)Lookup / Mapping Operations
import pandas as pd
df = pd.DataFrame({
'department_code': ['ENG', 'MKT', 'ENG', 'EXE', 'MKT']
})
dept_names = {
'ENG': 'Engineering',
'MKT': 'Marketing',
'EXE': 'Executive'
}
# SLOW: iterrows
names = []
for idx, row in df.iterrows():
names.append(dept_names.get(row['department_code'], 'Unknown'))
# FAST: map
df['department'] = df['department_code'].map(dept_names).fillna('Unknown')
print(df)When iterrows() IS Appropriate
Despite its performance drawbacks, iterrows has legitimate use cases:
1. Small DataFrames (Under ~1,000 Rows)
When the dataset is small, the performance difference is negligible. If iterrows makes your code clearer, use it:
import pandas as pd
# Configuration table with 5 rows -- iterrows is fine
config = pd.DataFrame({
'setting': ['timeout', 'retries', 'batch_size', 'debug', 'log_level'],
'value': ['30', '3', '100', 'true', 'INFO']
})
settings = {}
for idx, row in config.iterrows():
settings[row['setting']] = row['value']2. Complex Stateful Logic
When each row's processing depends on the results of previous rows, vectorization becomes difficult or impossible:
import pandas as pd
df = pd.DataFrame({
'transaction': ['deposit', 'withdrawal', 'deposit', 'withdrawal', 'deposit'],
'amount': [1000, 300, 500, 200, 800]
})
# Running balance that depends on previous state
balance = 0
balances = []
for idx, row in df.iterrows():
if row['transaction'] == 'deposit':
balance += row['amount']
else:
balance -= row['amount']
balances.append(balance)
df['balance'] = balances
print(df)Output:
transaction amount balance
0 deposit 1000 1000
1 withdrawal 300 700
2 deposit 500 1200
3 withdrawal 200 1000
4 deposit 800 1800Note: even for this case, cumsum() with conditional signs would be faster:
import numpy as np
signs = np.where(df['transaction'] == 'deposit', 1, -1)
df['balance_fast'] = (df['amount'] * signs).cumsum()3. Debugging and Exploration
When you need to inspect what is happening row by row, iterrows provides a natural debugging interface:
import pandas as pd
df = pd.DataFrame({
'value': [10, -5, 'invalid', 30, None]
})
# Debug: find problematic rows
for idx, row in df.iterrows():
try:
result = float(row['value']) * 2
except (ValueError, TypeError) as e:
print(f"Row {idx}: Error processing '{row['value']}' -- {e}")4. External API Calls or I/O Per Row
When each row triggers an API call, database query, or file operation, the I/O latency dwarfs the iteration overhead:
import pandas as pd
urls = pd.DataFrame({
'endpoint': ['/api/users/1', '/api/users/2', '/api/users/3'],
'method': ['GET', 'GET', 'GET']
})
# API calls dominate runtime -- iterrows overhead is irrelevant
# for idx, row in urls.iterrows():
# response = requests.get(base_url + row['endpoint'])
# # process responseCommon Mistakes with iterrows()
Mistake 1: Modifying the DataFrame During Iteration
This is the most dangerous pitfall. Changes made to row do not propagate back to the original DataFrame:
import pandas as pd
df = pd.DataFrame({'value': [1, 2, 3]})
# WRONG: This does NOT modify df
for idx, row in df.iterrows():
row['value'] = row['value'] * 10 # Modifies the copy, not df!
print(df)
# Output: unchanged!
# value
# 0 1
# 1 2
# 2 3If you need to modify the DataFrame during iteration (which you usually should not), use df.at[] or df.loc[]:
import pandas as pd
df = pd.DataFrame({'value': [1, 2, 3]})
# Works but slow -- use vectorized ops instead
for idx, row in df.iterrows():
df.at[idx, 'value'] = row['value'] * 10
print(df)
# Output:
# value
# 0 10
# 1 20
# 2 30The correct approach:
# BEST: vectorized
df['value'] = df['value'] * 10Mistake 2: Using iterrows When Column Types Matter
Because iterrows casts to object dtype, you can get unexpected type behavior:
import pandas as pd
df = pd.DataFrame({
'int_col': [1, 2, 3],
'float_col': [1.0, 2.0, 3.0]
})
for idx, row in df.iterrows():
# int_col might be returned as float!
print(f"int_col: {row['int_col']}, type: {type(row['int_col'])}")
breakThis can cause subtle bugs when type precision matters (e.g., comparing integer IDs).
Mistake 3: Appending to DataFrame Inside Loop
import pandas as pd
# TERRIBLE: Quadratic performance -- each append copies the entire DataFrame
df = pd.DataFrame(columns=['a', 'b'])
for i in range(1000):
df = pd.concat([df, pd.DataFrame({'a': [i], 'b': [i*2]})], ignore_index=True)
# CORRECT: Build a list first, then create DataFrame once
rows = []
for i in range(1000):
rows.append({'a': i, 'b': i * 2})
df = pd.DataFrame(rows)Real-World Example: Cleaning and Transforming Survey Data
Here is a realistic scenario that combines multiple concepts:
import pandas as pd
import numpy as np
# Raw survey data with messy responses
survey = pd.DataFrame({
'respondent': ['R001', 'R002', 'R003', 'R004', 'R005'],
'age': ['25', 'thirty', '42', '19', '55+'],
'satisfaction': [8, 9, -1, 7, 11],
'comment': ['Great!', '', 'N/A', 'Good service', None]
})
# ====== APPROACH 1: iterrows (readable but slow) ======
cleaned_rows = []
for idx, row in survey.iterrows():
clean = {}
clean['respondent'] = row['respondent']
# Parse age with error handling
try:
clean['age'] = int(row['age'])
except ValueError:
clean['age'] = np.nan
# Clamp satisfaction to valid range
sat = row['satisfaction']
clean['satisfaction'] = sat if 1 <= sat <= 10 else np.nan
# Normalize comments
comment = row['comment']
if pd.isna(comment) or comment.strip() in ('', 'N/A', 'n/a'):
clean['has_comment'] = False
else:
clean['has_comment'] = True
cleaned_rows.append(clean)
cleaned_df = pd.DataFrame(cleaned_rows)
print(cleaned_df)
# ====== APPROACH 2: vectorized (fast) ======
survey_v = survey.copy()
survey_v['age_clean'] = pd.to_numeric(survey_v['age'], errors='coerce')
survey_v['satisfaction_clean'] = survey_v['satisfaction'].where(
survey_v['satisfaction'].between(1, 10)
)
survey_v['has_comment'] = (
survey_v['comment'].notna() &
~survey_v['comment'].fillna('').str.strip().isin(['', 'N/A', 'n/a'])
)
print(survey_v[['respondent', 'age_clean', 'satisfaction_clean', 'has_comment']])Output:
respondent age satisfaction has_comment
0 R001 25.0 8.0 True
1 R002 NaN 9.0 False
2 R003 42.0 NaN False
3 R004 19.0 7.0 True
4 R005 NaN NaN FalseBoth approaches produce identical results. On 100,000 rows, the vectorized version runs in milliseconds while iterrows takes seconds.
Visualize Your Data with PyGWalker
After cleaning and transforming your DataFrame -- whether through iterrows for small datasets or vectorized operations for large ones -- visualizing the results helps you validate transformations and discover patterns. PyGWalker (opens in a new tab) turns any pandas DataFrame into an interactive, Tableau-style visual exploration interface directly inside Jupyter notebooks.
import pygwalker as pyg
# Explore your cleaned survey data interactively
walker = pyg.walk(cleaned_df)With PyGWalker, you can drag and drop columns to build charts, filter by conditions, and explore distributions -- all without writing additional plotting code. This is especially useful when validating data cleaning pipelines, where iterrows or vectorized ops transform raw data into analysis-ready formats.
If you are working in Jupyter and want an AI-powered agent to help with data analysis tasks like these, check out RunCell (opens in a new tab) -- an AI agent built for data scientists that runs directly in your notebook environment.
Quick Reference: Choosing the Right Iteration Method
Use this decision tree to pick the fastest approach for your situation:
- Can the operation be expressed as column arithmetic? Use vectorized operations (
df['a'] + df['b']) - Is it conditional assignment? Use
np.where()ornp.select() - Is it a string operation? Use
.straccessor methods - Is it a mapping/lookup? Use
.map()with a dictionary - Is it a grouped aggregation? Use
.groupby()with built-in aggregations - Must you iterate, and types matter? Use
itertuples() - Must you iterate, and column access by name matters with spaces? Use
iterrows() - Debugging or < 1000 rows?
iterrows()is fine
FAQ
What does pandas iterrows() return?
iterrows() returns a generator that yields (index, Series) pairs for each row in the DataFrame. The index is the row label, and the Series contains all column values for that row with column names as the Series index.
Is iterrows() slow in pandas?
Yes. iterrows() is one of the slowest ways to process DataFrame rows because it creates a new pandas Series object for each row, casts all values to Python objects, and operates in a Python-level loop instead of compiled C code. It is typically 100-5000x slower than vectorized operations.
What is the difference between iterrows() and itertuples()?
itertuples() returns lightweight namedtuples instead of Series objects, making it 20-30x faster than iterrows(). It also preserves column dtypes rather than casting everything to object. Use itertuples() whenever you need row-by-row iteration and performance matters.
How do I modify a DataFrame while using iterrows()?
You cannot modify the original DataFrame through the row variable returned by iterrows -- it is a copy. Use df.at[index, 'column'] = value inside the loop, or better yet, build a list and assign it after the loop. The fastest approach is to avoid iteration entirely and use vectorized operations.
When should I use iterrows() instead of vectorized operations?
Use iterrows when: (1) your DataFrame has fewer than ~1,000 rows and readability matters more than speed, (2) each row requires complex stateful logic that depends on previous rows, (3) you are debugging and need to inspect row-by-row processing, or (4) each row triggers an external API call or I/O operation where latency dominates runtime.
Can iterrows() change column data types?
Yes, and this is a common source of bugs. Because iterrows converts each row to a Series with a single dtype, mixed-type DataFrames (integers and strings) will have all values cast to object dtype. Integer columns may become floats. Use itertuples() if you need type-safe iteration.
Conclusion
The pandas iterrows() method provides a straightforward way to loop over DataFrame rows, and understanding its behavior is essential for any data scientist working with pandas. However, reaching for it by default is a performance anti-pattern that slows down data pipelines by orders of magnitude.
The hierarchy of approaches is clear: vectorized operations first, then itertuples() for necessary iteration, then apply() for complex row-level functions, and iterrows() only when debugging, working with tiny datasets, or handling stateful logic that defies vectorization. When you need to filter rows based on conditions, vectorized boolean indexing is always the right choice over iteration.
Build the habit of writing vectorized code from the start. When you catch yourself writing for idx, row in df.iterrows():, pause and ask: can this be expressed as a column operation? Nine times out of ten, the answer is yes -- and the result will be cleaner, faster, and more idiomatic pandas.
Related Guides
- Pandas apply(): Row and Column Transformations
- Pandas GroupBy: Aggregation, Transform, Apply
- Pandas DataFrame loc: Label-Based Indexing
- Pandas Filter Rows: Select Data by Condition