Pandas Filter Rows: Select Data by Condition in Python
Updated on
Filtering rows in pandas DataFrames is one of the most common operations in data analysis. Whether you're cleaning data, exploring patterns, or preparing datasets for machine learning, you need to select specific rows based on conditions. The challenge is choosing the right method among pandas' multiple filtering approaches—each with different syntax, performance characteristics, and use cases.
Many data scientists struggle with filtering efficiency, especially when working with large datasets or complex conditions. Using the wrong method can slow down your analysis by orders of magnitude. Understanding when to use boolean indexing versus .query() or .loc[] can mean the difference between a script that runs in seconds versus one that takes minutes.
This guide covers all pandas filtering methods with practical examples, performance comparisons, and best practices. You'll learn boolean indexing, the .query() method, .loc[] selection, the .where() function, and the .filter() method. By the end, you'll know exactly which approach to use for any filtering scenario.
Understanding Pandas Row Filtering Methods
Pandas provides five primary methods for filtering DataFrame rows, each suited to different scenarios:
| Method | Best For | Syntax Example | Performance |
|---|---|---|---|
| Boolean Indexing | Simple conditions, readability | df[df['age'] > 25] | Fast for small-medium data |
.query() | Complex conditions, string-based | df.query('age > 25 and city == "NYC"') | Faster for large data |
.loc[] | Label-based with conditions | df.loc[df['age'] > 25, ['name', 'age']] | Flexible column selection |
.where() | Keep structure, replace values | df.where(df['age'] > 25, np.nan) | Preserves DataFrame shape |
.filter() | Filter by column/index names | df.filter(like='total', axis=1) | Column/index name patterns |
Let's explore each method with detailed examples.
Boolean Indexing: The Most Common Approach
Boolean indexing filters rows by creating a boolean mask (True/False values) and applying it to the DataFrame. This is the most intuitive method for beginners.
import pandas as pd
import numpy as np
# Create sample DataFrame
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'age': [25, 30, 35, 28, 32],
'city': ['NYC', 'LA', 'NYC', 'Chicago', 'LA'],
'salary': [70000, 80000, 90000, 75000, 85000]
})
# Filter rows where age is greater than 30
filtered = df[df['age'] > 30]
print(filtered)
# name age city salary
# 2 Charlie 35 NYC 90000
# 4 Eve 32 LA 85000
# See the boolean mask
print(df['age'] > 30)
# 0 False
# 1 False
# 2 True
# 3 False
# 4 TrueBoolean indexing works by evaluating the condition df['age'] > 30, which returns a Series of True/False values. When you pass this mask to the DataFrame with df[mask], pandas returns only rows where the mask is True.
Filtering with Multiple Conditions
Combine multiple conditions using logical operators. Important: Use & (and), | (or), ~ (not) instead of Python's and, or, not keywords. Always wrap each condition in parentheses.
# Multiple conditions with AND
filtered = df[(df['age'] > 25) & (df['city'] == 'NYC')]
print(filtered)
# name age city salary
# 2 Charlie 35 NYC 90000
# Multiple conditions with OR
filtered = df[(df['age'] > 30) | (df['salary'] > 80000)]
print(filtered)
# name age city salary
# 2 Charlie 35 NYC 90000
# 4 Eve 32 LA 85000
# NOT operator
filtered = df[~(df['city'] == 'NYC')]
print(filtered)
# name age city salary
# 1 Bob 30 LA 80000
# 3 David 28 Chicago 75000
# 4 Eve 32 LA 85000Using .isin() for Multiple Values
Filter rows where a column matches any value in a list using .isin():
# Filter rows where city is NYC or LA
cities = ['NYC', 'LA']
filtered = df[df['city'].isin(cities)]
print(filtered)
# name age city salary
# 0 Alice 25 NYC 70000
# 1 Bob 30 LA 80000
# 2 Charlie 35 NYC 90000
# 4 Eve 32 LA 85000
# Inverse: cities NOT in list
filtered = df[~df['city'].isin(cities)]
print(filtered)
# name age city salary
# 3 David 28 Chicago 75000Filtering with .between()
The .between() method filters values within a range (inclusive by default):
# Filter ages between 28 and 32
filtered = df[df['age'].between(28, 32)]
print(filtered)
# name age city salary
# 1 Bob 30 LA 80000
# 3 David 28 Chicago 75000
# 4 Eve 32 LA 85000
# Exclusive boundaries
filtered = df[df['age'].between(28, 32, inclusive='neither')]
print(filtered)
# name age city salary
# 1 Bob 30 LA 80000The .query() Method: String-Based Filtering
The .query() method accepts a string expression, making it readable for complex conditions. It's particularly efficient for large DataFrames because it uses numexpr for optimization.
# Simple query
filtered = df.query('age > 30')
print(filtered)
# name age city salary
# 2 Charlie 35 NYC 90000
# 4 Eve 32 LA 85000
# Multiple conditions
filtered = df.query('age > 25 and city == "NYC"')
print(filtered)
# name age city salary
# 2 Charlie 35 NYC 90000
# Using variables with @ symbol
min_age = 30
filtered = df.query('age > @min_age')
print(filtered)
# name age city salary
# 2 Charlie 35 NYC 90000
# 4 Eve 32 LA 85000Advanced .query() Expressions
# Using .isin() in query
cities = ['NYC', 'LA']
filtered = df.query('city in @cities')
print(filtered)
# Range conditions
filtered = df.query('28 <= age <= 32')
print(filtered)
# String methods
filtered = df.query('city.str.contains("LA")', engine='python')
print(filtered).loc[] for Label-Based Filtering
The .loc[] indexer combines row filtering with column selection. Use it when you need specific columns from filtered rows.
# Filter rows and select columns
filtered = df.loc[df['age'] > 30, ['name', 'age']]
print(filtered)
# name age
# 2 Charlie 35
# 4 Eve 32
# Multiple conditions
filtered = df.loc[(df['age'] > 25) & (df['salary'] > 75000), ['name', 'salary']]
print(filtered)
# name salary
# 1 Bob 80000
# 2 Charlie 90000
# 4 Eve 85000
# All columns
filtered = df.loc[df['city'] == 'NYC', :]
print(filtered).where() Method: Conditional Value Replacement
Unlike other methods, .where() preserves DataFrame shape by replacing values that don't meet the condition with NaN (or a specified value).
# Keep values where age > 30, replace others with NaN
result = df.where(df['age'] > 30)
print(result)
# name age city salary
# 0 NaN NaN NaN NaN
# 1 NaN NaN NaN NaN
# 2 Charlie 35.0 NYC 90000.0
# 3 NaN NaN NaN NaN
# 4 Eve 32.0 LA 85000.0
# Replace with custom value
result = df.where(df['age'] > 30, 'FILTERED')
print(result)
# Drop rows with NaN after .where()
result = df.where(df['age'] > 30).dropna()
print(result).filter() Method: Filter by Column/Index Names
The .filter() method filters columns or rows by their labels (names), not by values. Use it for pattern-based column selection.
# Create DataFrame with multiple columns
df_wide = pd.DataFrame({
'total_sales': [100, 200, 300],
'total_profit': [20, 40, 60],
'monthly_sales': [10, 20, 30],
'yearly_sales': [120, 240, 360],
'region': ['East', 'West', 'North']
})
# Filter columns containing 'total'
filtered = df_wide.filter(like='total')
print(filtered)
# total_sales total_profit
# 0 100 20
# 1 200 40
# 2 300 60
# Filter columns using regex
filtered = df_wide.filter(regex=r'.*sales$')
print(filtered)
# total_sales monthly_sales yearly_sales
# 0 100 10 120
# 1 200 20 240
# 2 300 30 360
# Filter columns by exact names
filtered = df_wide.filter(items=['total_sales', 'region'])
print(filtered)Filtering String Data
Pandas provides string methods through the .str accessor for filtering text columns.
# Create DataFrame with text data
df_text = pd.DataFrame({
'name': ['Alice Smith', 'Bob Johnson', 'Charlie Brown', 'David Lee'],
'email': ['alice@gmail.com', 'bob@yahoo.com', 'charlie@gmail.com', 'david@outlook.com']
})
# Filter rows where name contains 'Smith'
filtered = df_text[df_text['name'].str.contains('Smith')]
print(filtered)
# name email
# 0 Alice Smith alice@gmail.com
# Case-insensitive search
filtered = df_text[df_text['name'].str.contains('smith', case=False)]
print(filtered)
# Filter by email domain
filtered = df_text[df_text['email'].str.endswith('gmail.com')]
print(filtered)
# name email
# 0 Alice Smith alice@gmail.com
# 2 Charlie Brown charlie@gmail.com
# Filter with regex
filtered = df_text[df_text['name'].str.match(r'^[A-C]')]
print(filtered)Filtering Null and Non-Null Values
Use .isna(), .notna(), .isnull(), and .notnull() to filter based on missing data.
# Create DataFrame with missing values
df_missing = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, 7, 8],
'C': [9, 10, 11, 12]
})
# Filter rows where column A is not null
filtered = df_missing[df_missing['A'].notna()]
print(filtered)
# A B C
# 0 1.0 5.0 9
# 1 2.0 NaN 10
# 3 4.0 8.0 12
# Filter rows where ANY column is null
filtered = df_missing[df_missing.isna().any(axis=1)]
print(filtered)
# A B C
# 1 2.0 NaN 10
# 2 NaN 7.0 11
# Filter rows where ALL columns are not null
filtered = df_missing[df_missing.notna().all(axis=1)]
print(filtered)
# A B C
# 0 1.0 5.0 9
# 3 4.0 8.0 12Filtering by Date Ranges
When working with datetime columns, you can filter by date ranges using standard comparison operators.
# Create DataFrame with dates
df_dates = pd.DataFrame({
'date': pd.date_range('2026-01-01', periods=10, freq='D'),
'value': range(10)
})
# Filter dates after a specific date
filtered = df_dates[df_dates['date'] > '2026-01-05']
print(filtered)
# Filter date range
start_date = '2026-01-03'
end_date = '2026-01-07'
filtered = df_dates[(df_dates['date'] >= start_date) & (df_dates['date'] <= end_date)]
print(filtered)
# Using .between() for dates
filtered = df_dates[df_dates['date'].between('2026-01-03', '2026-01-07')]
print(filtered)Performance Comparison: Large DataFrames
Different filtering methods have different performance characteristics. Here's a comparison for a DataFrame with 1 million rows:
import time
# Create large DataFrame
np.random.seed(42)
df_large = pd.DataFrame({
'A': np.random.randint(0, 100, 1000000),
'B': np.random.randint(0, 100, 1000000),
'C': np.random.choice(['X', 'Y', 'Z'], 1000000)
})
# Boolean indexing
start = time.time()
result = df_large[(df_large['A'] > 50) & (df_large['B'] < 30)]
print(f"Boolean indexing: {time.time() - start:.4f} seconds")
# .query() method
start = time.time()
result = df_large.query('A > 50 and B < 30')
print(f".query() method: {time.time() - start:.4f} seconds")
# .loc[] method
start = time.time()
result = df_large.loc[(df_large['A'] > 50) & (df_large['B'] < 30)]
print(f".loc[] method: {time.time() - start:.4f} seconds")Performance insights:
- Boolean indexing: Fast for simple conditions, slower for complex multi-condition filters
- .query(): Fastest for large DataFrames with multiple conditions (uses numexpr optimization)
- .loc[]: Similar to boolean indexing but more flexible for column selection
- .where(): Slower due to full DataFrame traversal, use only when you need to preserve shape
For datasets over 100,000 rows with multiple conditions, .query() typically outperforms boolean indexing by 20-40%.
Common Mistakes and How to Avoid Them
Mistake 1: Using 'and' Instead of '&'
# WRONG - raises ValueError
# filtered = df[df['age'] > 25 and df['city'] == 'NYC']
# CORRECT
filtered = df[(df['age'] > 25) & (df['city'] == 'NYC')]Mistake 2: Forgetting Parentheses
# WRONG - operator precedence issues
# filtered = df[df['age'] > 25 & df['city'] == 'NYC']
# CORRECT - wrap each condition
filtered = df[(df['age'] > 25) & (df['city'] == 'NYC')]Mistake 3: Modifying Original DataFrame
# Filtering creates a view, not a copy
filtered = df[df['age'] > 30]
# WRONG - SettingWithCopyWarning
# filtered['new_col'] = 100
# CORRECT - create explicit copy
filtered = df[df['age'] > 30].copy()
filtered['new_col'] = 100Mistake 4: String Filtering Without Handling NaN
df_with_nan = pd.DataFrame({
'name': ['Alice', np.nan, 'Charlie']
})
# WRONG - raises error if NaN present
# filtered = df_with_nan[df_with_nan['name'].str.contains('li')]
# CORRECT - handle NaN with na parameter
filtered = df_with_nan[df_with_nan['name'].str.contains('li', na=False)]Visualize Filtered Data with PyGWalker
After filtering your pandas DataFrame, visualizing the results helps uncover patterns and insights. PyGWalker (opens in a new tab) transforms filtered DataFrames into an interactive Tableau-like interface directly in Python notebooks—no need to export data or write complex plotting code.
PyGWalker is particularly useful when exploring filtered datasets because it allows you to:
- Drag and drop columns to create charts instantly
- Apply additional filters visually without writing code
- Switch between chart types (bar, line, scatter, heatmap) in seconds
- Export visualizations for reports or presentations
import pygwalker as pyg
# Filter DataFrame
filtered_df = df[(df['age'] > 25) & (df['salary'] > 75000)]
# Launch interactive visualization
pyg.walk(filtered_df)This opens an interactive interface where you can create visualizations by dragging filtered columns into the chart builder. For data analysts working with multiple filter conditions, PyGWalker eliminates the iteration cycle of filter → plot → adjust → replot.
FAQ
How do I filter pandas DataFrame rows by condition?
Use boolean indexing with df[df['column'] > value] for simple conditions. For multiple conditions, use & (and), | (or), ~ (not) with parentheses: df[(df['A'] > 10) & (df['B'] < 20)]. Alternatively, use .query() for readable syntax: df.query('A > 10 and B < 20').
What's the difference between .loc[] and boolean indexing?
Boolean indexing (df[df['col'] > 5]) returns all columns for filtered rows. .loc[] allows column selection simultaneously: df.loc[df['col'] > 5, ['col1', 'col2']]. Both methods have similar performance, but .loc[] is more explicit and supports label-based row selection.
How to filter pandas DataFrame with multiple conditions?
Combine conditions using & (and), | (or), ~ (not) operators. Always wrap each condition in parentheses: df[(df['age'] > 25) & (df['city'] == 'NYC') | (df['salary'] > 80000)]. For complex conditions, use .query(): df.query('age > 25 and (city == "NYC" or salary > 80000)').
When should I use .query() instead of boolean indexing?
Use .query() for large DataFrames (>100k rows) with multiple conditions—it's 20-40% faster due to numexpr optimization. It's also more readable for complex conditions. Use boolean indexing for simple filters or when you need maximum compatibility (.query() requires string expressions which are harder to debug).
How do I filter rows where column value is in a list?
Use .isin() method: df[df['city'].isin(['NYC', 'LA', 'Chicago'])]. For the inverse (not in list), use ~ operator: df[~df['city'].isin(['NYC', 'LA'])]. This is more efficient than chaining multiple | conditions for large lists.
Conclusion
Filtering pandas DataFrame rows is a fundamental skill for data analysis in Python. You now understand five filtering methods: boolean indexing for simplicity, .query() for performance, .loc[] for flexible column selection, .where() for value replacement, and .filter() for column name patterns.
Key takeaways:
- Use boolean indexing for quick, readable filters on small to medium datasets
- Choose .query() for large datasets with complex conditions (20-40% faster)
- Apply .loc[] when you need both row filtering and column selection
- Remember to use
&,|,~operators instead ofand,or,not - Always wrap conditions in parentheses to avoid operator precedence issues
Master these filtering techniques and you'll handle any data selection scenario efficiently. For interactive exploration of filtered data, consider using PyGWalker to visualize your results without writing additional plotting code.