Skip to content
Topics
Pandas
Pandas Filter Rows: Select Data by Condition in Python

Pandas Filter Rows: Select Data by Condition in Python

Updated on

Filtering rows in pandas DataFrames is one of the most common operations in data analysis. Whether you're cleaning data, exploring patterns, or preparing datasets for machine learning, you need to select specific rows based on conditions. The challenge is choosing the right method among pandas' multiple filtering approaches—each with different syntax, performance characteristics, and use cases.

Many data scientists struggle with filtering efficiency, especially when working with large datasets or complex conditions. Using the wrong method can slow down your analysis by orders of magnitude. Understanding when to use boolean indexing versus .query() or .loc[] can mean the difference between a script that runs in seconds versus one that takes minutes.

This guide covers all pandas filtering methods with practical examples, performance comparisons, and best practices. You'll learn boolean indexing, the .query() method, .loc[] selection, the .where() function, and the .filter() method. By the end, you'll know exactly which approach to use for any filtering scenario.

📚

Understanding Pandas Row Filtering Methods

Pandas provides five primary methods for filtering DataFrame rows, each suited to different scenarios:

MethodBest ForSyntax ExamplePerformance
Boolean IndexingSimple conditions, readabilitydf[df['age'] > 25]Fast for small-medium data
.query()Complex conditions, string-baseddf.query('age > 25 and city == "NYC"')Faster for large data
.loc[]Label-based with conditionsdf.loc[df['age'] > 25, ['name', 'age']]Flexible column selection
.where()Keep structure, replace valuesdf.where(df['age'] > 25, np.nan)Preserves DataFrame shape
.filter()Filter by column/index namesdf.filter(like='total', axis=1)Column/index name patterns

Let's explore each method with detailed examples.

Boolean Indexing: The Most Common Approach

Boolean indexing filters rows by creating a boolean mask (True/False values) and applying it to the DataFrame. This is the most intuitive method for beginners.

import pandas as pd
import numpy as np
 
# Create sample DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'age': [25, 30, 35, 28, 32],
    'city': ['NYC', 'LA', 'NYC', 'Chicago', 'LA'],
    'salary': [70000, 80000, 90000, 75000, 85000]
})
 
# Filter rows where age is greater than 30
filtered = df[df['age'] > 30]
print(filtered)
#       name  age city  salary
# 2  Charlie   35  NYC   90000
# 4      Eve   32   LA   85000
 
# See the boolean mask
print(df['age'] > 30)
# 0    False
# 1    False
# 2     True
# 3    False
# 4     True

Boolean indexing works by evaluating the condition df['age'] > 30, which returns a Series of True/False values. When you pass this mask to the DataFrame with df[mask], pandas returns only rows where the mask is True.

Filtering with Multiple Conditions

Combine multiple conditions using logical operators. Important: Use & (and), | (or), ~ (not) instead of Python's and, or, not keywords. Always wrap each condition in parentheses.

# Multiple conditions with AND
filtered = df[(df['age'] > 25) & (df['city'] == 'NYC')]
print(filtered)
#       name  age city  salary
# 2  Charlie   35  NYC   90000
 
# Multiple conditions with OR
filtered = df[(df['age'] > 30) | (df['salary'] > 80000)]
print(filtered)
#       name  age     city  salary
# 2  Charlie   35      NYC   90000
# 4      Eve   32       LA   85000
 
# NOT operator
filtered = df[~(df['city'] == 'NYC')]
print(filtered)
#     name  age     city  salary
# 1    Bob   30       LA   80000
# 3  David   28  Chicago   75000
# 4    Eve   32       LA   85000

Using .isin() for Multiple Values

Filter rows where a column matches any value in a list using .isin():

# Filter rows where city is NYC or LA
cities = ['NYC', 'LA']
filtered = df[df['city'].isin(cities)]
print(filtered)
#       name  age city  salary
# 0    Alice   25  NYC   70000
# 1      Bob   30   LA   80000
# 2  Charlie   35  NYC   90000
# 4      Eve   32   LA   85000
 
# Inverse: cities NOT in list
filtered = df[~df['city'].isin(cities)]
print(filtered)
#     name  age     city  salary
# 3  David   28  Chicago   75000

Filtering with .between()

The .between() method filters values within a range (inclusive by default):

# Filter ages between 28 and 32
filtered = df[df['age'].between(28, 32)]
print(filtered)
#     name  age     city  salary
# 1    Bob   30       LA   80000
# 3  David   28  Chicago   75000
# 4    Eve   32       LA   85000
 
# Exclusive boundaries
filtered = df[df['age'].between(28, 32, inclusive='neither')]
print(filtered)
#   name  age city  salary
# 1  Bob   30   LA   80000

The .query() Method: String-Based Filtering

The .query() method accepts a string expression, making it readable for complex conditions. It's particularly efficient for large DataFrames because it uses numexpr for optimization.

# Simple query
filtered = df.query('age > 30')
print(filtered)
#       name  age city  salary
# 2  Charlie   35  NYC   90000
# 4      Eve   32   LA   85000
 
# Multiple conditions
filtered = df.query('age > 25 and city == "NYC"')
print(filtered)
#       name  age city  salary
# 2  Charlie   35  NYC   90000
 
# Using variables with @ symbol
min_age = 30
filtered = df.query('age > @min_age')
print(filtered)
#       name  age city  salary
# 2  Charlie   35  NYC   90000
# 4      Eve   32   LA   85000

Advanced .query() Expressions

# Using .isin() in query
cities = ['NYC', 'LA']
filtered = df.query('city in @cities')
print(filtered)
 
# Range conditions
filtered = df.query('28 <= age <= 32')
print(filtered)
 
# String methods
filtered = df.query('city.str.contains("LA")', engine='python')
print(filtered)

.loc[] for Label-Based Filtering

The .loc[] indexer combines row filtering with column selection. Use it when you need specific columns from filtered rows.

# Filter rows and select columns
filtered = df.loc[df['age'] > 30, ['name', 'age']]
print(filtered)
#       name  age
# 2  Charlie   35
# 4      Eve   32
 
# Multiple conditions
filtered = df.loc[(df['age'] > 25) & (df['salary'] > 75000), ['name', 'salary']]
print(filtered)
#       name  salary
# 1      Bob   80000
# 2  Charlie   90000
# 4      Eve   85000
 
# All columns
filtered = df.loc[df['city'] == 'NYC', :]
print(filtered)

.where() Method: Conditional Value Replacement

Unlike other methods, .where() preserves DataFrame shape by replacing values that don't meet the condition with NaN (or a specified value).

# Keep values where age > 30, replace others with NaN
result = df.where(df['age'] > 30)
print(result)
#       name   age city   salary
# 0      NaN   NaN  NaN      NaN
# 1      NaN   NaN  NaN      NaN
# 2  Charlie  35.0  NYC  90000.0
# 3      NaN   NaN  NaN      NaN
# 4      Eve  32.0   LA  85000.0
 
# Replace with custom value
result = df.where(df['age'] > 30, 'FILTERED')
print(result)
 
# Drop rows with NaN after .where()
result = df.where(df['age'] > 30).dropna()
print(result)

.filter() Method: Filter by Column/Index Names

The .filter() method filters columns or rows by their labels (names), not by values. Use it for pattern-based column selection.

# Create DataFrame with multiple columns
df_wide = pd.DataFrame({
    'total_sales': [100, 200, 300],
    'total_profit': [20, 40, 60],
    'monthly_sales': [10, 20, 30],
    'yearly_sales': [120, 240, 360],
    'region': ['East', 'West', 'North']
})
 
# Filter columns containing 'total'
filtered = df_wide.filter(like='total')
print(filtered)
#    total_sales  total_profit
# 0          100            20
# 1          200            40
# 2          300            60
 
# Filter columns using regex
filtered = df_wide.filter(regex=r'.*sales$')
print(filtered)
#    total_sales  monthly_sales  yearly_sales
# 0          100             10           120
# 1          200             20           240
# 2          300             30           360
 
# Filter columns by exact names
filtered = df_wide.filter(items=['total_sales', 'region'])
print(filtered)

Filtering String Data

Pandas provides string methods through the .str accessor for filtering text columns.

# Create DataFrame with text data
df_text = pd.DataFrame({
    'name': ['Alice Smith', 'Bob Johnson', 'Charlie Brown', 'David Lee'],
    'email': ['alice@gmail.com', 'bob@yahoo.com', 'charlie@gmail.com', 'david@outlook.com']
})
 
# Filter rows where name contains 'Smith'
filtered = df_text[df_text['name'].str.contains('Smith')]
print(filtered)
#          name           email
# 0  Alice Smith  alice@gmail.com
 
# Case-insensitive search
filtered = df_text[df_text['name'].str.contains('smith', case=False)]
print(filtered)
 
# Filter by email domain
filtered = df_text[df_text['email'].str.endswith('gmail.com')]
print(filtered)
#            name              email
# 0   Alice Smith    alice@gmail.com
# 2  Charlie Brown  charlie@gmail.com
 
# Filter with regex
filtered = df_text[df_text['name'].str.match(r'^[A-C]')]
print(filtered)

Filtering Null and Non-Null Values

Use .isna(), .notna(), .isnull(), and .notnull() to filter based on missing data.

# Create DataFrame with missing values
df_missing = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, 7, 8],
    'C': [9, 10, 11, 12]
})
 
# Filter rows where column A is not null
filtered = df_missing[df_missing['A'].notna()]
print(filtered)
#      A    B   C
# 0  1.0  5.0   9
# 1  2.0  NaN  10
# 3  4.0  8.0  12
 
# Filter rows where ANY column is null
filtered = df_missing[df_missing.isna().any(axis=1)]
print(filtered)
#      A    B   C
# 1  2.0  NaN  10
# 2  NaN  7.0  11
 
# Filter rows where ALL columns are not null
filtered = df_missing[df_missing.notna().all(axis=1)]
print(filtered)
#      A    B   C
# 0  1.0  5.0   9
# 3  4.0  8.0  12

Filtering by Date Ranges

When working with datetime columns, you can filter by date ranges using standard comparison operators.

# Create DataFrame with dates
df_dates = pd.DataFrame({
    'date': pd.date_range('2026-01-01', periods=10, freq='D'),
    'value': range(10)
})
 
# Filter dates after a specific date
filtered = df_dates[df_dates['date'] > '2026-01-05']
print(filtered)
 
# Filter date range
start_date = '2026-01-03'
end_date = '2026-01-07'
filtered = df_dates[(df_dates['date'] >= start_date) & (df_dates['date'] <= end_date)]
print(filtered)
 
# Using .between() for dates
filtered = df_dates[df_dates['date'].between('2026-01-03', '2026-01-07')]
print(filtered)

Performance Comparison: Large DataFrames

Different filtering methods have different performance characteristics. Here's a comparison for a DataFrame with 1 million rows:

import time
 
# Create large DataFrame
np.random.seed(42)
df_large = pd.DataFrame({
    'A': np.random.randint(0, 100, 1000000),
    'B': np.random.randint(0, 100, 1000000),
    'C': np.random.choice(['X', 'Y', 'Z'], 1000000)
})
 
# Boolean indexing
start = time.time()
result = df_large[(df_large['A'] > 50) & (df_large['B'] < 30)]
print(f"Boolean indexing: {time.time() - start:.4f} seconds")
 
# .query() method
start = time.time()
result = df_large.query('A > 50 and B < 30')
print(f".query() method: {time.time() - start:.4f} seconds")
 
# .loc[] method
start = time.time()
result = df_large.loc[(df_large['A'] > 50) & (df_large['B'] < 30)]
print(f".loc[] method: {time.time() - start:.4f} seconds")

Performance insights:

  • Boolean indexing: Fast for simple conditions, slower for complex multi-condition filters
  • .query(): Fastest for large DataFrames with multiple conditions (uses numexpr optimization)
  • .loc[]: Similar to boolean indexing but more flexible for column selection
  • .where(): Slower due to full DataFrame traversal, use only when you need to preserve shape

For datasets over 100,000 rows with multiple conditions, .query() typically outperforms boolean indexing by 20-40%.

Common Mistakes and How to Avoid Them

Mistake 1: Using 'and' Instead of '&'

# WRONG - raises ValueError
# filtered = df[df['age'] > 25 and df['city'] == 'NYC']
 
# CORRECT
filtered = df[(df['age'] > 25) & (df['city'] == 'NYC')]

Mistake 2: Forgetting Parentheses

# WRONG - operator precedence issues
# filtered = df[df['age'] > 25 & df['city'] == 'NYC']
 
# CORRECT - wrap each condition
filtered = df[(df['age'] > 25) & (df['city'] == 'NYC')]

Mistake 3: Modifying Original DataFrame

# Filtering creates a view, not a copy
filtered = df[df['age'] > 30]
 
# WRONG - SettingWithCopyWarning
# filtered['new_col'] = 100
 
# CORRECT - create explicit copy
filtered = df[df['age'] > 30].copy()
filtered['new_col'] = 100

Mistake 4: String Filtering Without Handling NaN

df_with_nan = pd.DataFrame({
    'name': ['Alice', np.nan, 'Charlie']
})
 
# WRONG - raises error if NaN present
# filtered = df_with_nan[df_with_nan['name'].str.contains('li')]
 
# CORRECT - handle NaN with na parameter
filtered = df_with_nan[df_with_nan['name'].str.contains('li', na=False)]

Visualize Filtered Data with PyGWalker

After filtering your pandas DataFrame, visualizing the results helps uncover patterns and insights. PyGWalker (opens in a new tab) transforms filtered DataFrames into an interactive Tableau-like interface directly in Python notebooks—no need to export data or write complex plotting code.

PyGWalker is particularly useful when exploring filtered datasets because it allows you to:

  • Drag and drop columns to create charts instantly
  • Apply additional filters visually without writing code
  • Switch between chart types (bar, line, scatter, heatmap) in seconds
  • Export visualizations for reports or presentations
import pygwalker as pyg
 
# Filter DataFrame
filtered_df = df[(df['age'] > 25) & (df['salary'] > 75000)]
 
# Launch interactive visualization
pyg.walk(filtered_df)

This opens an interactive interface where you can create visualizations by dragging filtered columns into the chart builder. For data analysts working with multiple filter conditions, PyGWalker eliminates the iteration cycle of filter → plot → adjust → replot.

FAQ

How do I filter pandas DataFrame rows by condition?

Use boolean indexing with df[df['column'] > value] for simple conditions. For multiple conditions, use & (and), | (or), ~ (not) with parentheses: df[(df['A'] > 10) & (df['B'] < 20)]. Alternatively, use .query() for readable syntax: df.query('A > 10 and B < 20').

What's the difference between .loc[] and boolean indexing?

Boolean indexing (df[df['col'] > 5]) returns all columns for filtered rows. .loc[] allows column selection simultaneously: df.loc[df['col'] > 5, ['col1', 'col2']]. Both methods have similar performance, but .loc[] is more explicit and supports label-based row selection.

How to filter pandas DataFrame with multiple conditions?

Combine conditions using & (and), | (or), ~ (not) operators. Always wrap each condition in parentheses: df[(df['age'] > 25) & (df['city'] == 'NYC') | (df['salary'] > 80000)]. For complex conditions, use .query(): df.query('age > 25 and (city == "NYC" or salary > 80000)').

When should I use .query() instead of boolean indexing?

Use .query() for large DataFrames (>100k rows) with multiple conditions—it's 20-40% faster due to numexpr optimization. It's also more readable for complex conditions. Use boolean indexing for simple filters or when you need maximum compatibility (.query() requires string expressions which are harder to debug).

How do I filter rows where column value is in a list?

Use .isin() method: df[df['city'].isin(['NYC', 'LA', 'Chicago'])]. For the inverse (not in list), use ~ operator: df[~df['city'].isin(['NYC', 'LA'])]. This is more efficient than chaining multiple | conditions for large lists.

Conclusion

Filtering pandas DataFrame rows is a fundamental skill for data analysis in Python. You now understand five filtering methods: boolean indexing for simplicity, .query() for performance, .loc[] for flexible column selection, .where() for value replacement, and .filter() for column name patterns.

Key takeaways:

  • Use boolean indexing for quick, readable filters on small to medium datasets
  • Choose .query() for large datasets with complex conditions (20-40% faster)
  • Apply .loc[] when you need both row filtering and column selection
  • Remember to use &, |, ~ operators instead of and, or, not
  • Always wrap conditions in parentheses to avoid operator precedence issues

Master these filtering techniques and you'll handle any data selection scenario efficiently. For interactive exploration of filtered data, consider using PyGWalker to visualize your results without writing additional plotting code.

📚