Pandas: Find and Filter Values in a DataFrame Column
Updated on
You have a DataFrame with 500,000 rows of sales records, and you need to find every transaction from a specific customer, above a certain dollar amount, in a particular region. Writing a for-loop would take minutes. Using the wrong pandas method could still be painfully slow or return the wrong subset of data entirely.
The problem compounds when filtering conditions get complex. Multiple columns, regex patterns on text fields, missing values scattered across the dataset, and combinations of AND/OR logic all make the task harder. A single misplaced parenthesis or a forgotten na=False flag can silently drop rows or raise cryptic errors.
Pandas provides a rich toolkit for searching and filtering DataFrame columns -- boolean indexing, query(), isin(), str.contains(), where(), loc[], and filter(). This guide covers each method with real-world examples, a comparison table, and performance benchmarks so you know exactly which approach to use and when.
Setup: Sample DataFrame
All examples in this guide use the following sample dataset. Copy this block to follow along:
import pandas as pd
import numpy as np
data = {
'order_id': [1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008],
'customer': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob', 'Diana', np.nan, 'Alice'],
'product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Headphones', 'Phone', 'Laptop', 'Monitor'],
'amount': [1200.00, 799.99, 450.00, 1350.50, 59.99, 899.00, 1100.00, 320.00],
'region': ['East', 'West', 'East', 'North', 'West', 'South', 'East', 'North'],
'status': ['completed', 'completed', 'pending', 'completed', 'returned', 'pending', 'completed', 'completed']
}
df = pd.DataFrame(data)
print(df)Output:
order_id customer product amount region status
0 1001 Alice Laptop 1200.00 East completed
1 1002 Bob Phone 799.99 West completed
2 1003 Alice Tablet 450.00 East pending
3 1004 Charlie Laptop 1350.50 North completed
4 1005 Bob Headphones 59.99 West returned
5 1006 Diana Phone 899.00 South pending
6 1007 NaN Laptop 1100.00 East completed
7 1008 Alice Monitor 320.00 North completedBoolean Indexing: Filter by Column Value
Boolean indexing is the most common way to filter a pandas DataFrame. You pass a condition inside square brackets, and pandas returns only the rows where that condition is True.
Exact Match
# Find all orders from Alice
alice_orders = df[df['customer'] == 'Alice']
print(alice_orders)Comparison Operators
You can use any standard comparison operator to filter by column value:
# Orders above $500
large_orders = df[df['amount'] > 500]
# Orders between $100 and $1000 (inclusive)
mid_range = df[(df['amount'] >= 100) & (df['amount'] <= 1000)]
print(mid_range)| Operator | Meaning | Example |
|---|---|---|
== | Equal to | df[df['region'] == 'East'] |
!= | Not equal to | df[df['status'] != 'returned'] |
> / < | Greater / less than | df[df['amount'] > 1000] |
>= / <= | Greater or equal / less or equal | df[df['amount'] >= 500] |
Negation with ~
# All orders NOT from the East region
non_east = df[~(df['region'] == 'East')]
print(non_east)Filtering with isin() for Multiple Values
When you need to check whether a column value matches any item in a list, isin() is cleaner and faster than chaining multiple == conditions with |.
# Orders from East or North regions
target_regions = ['East', 'North']
regional = df[df['region'].isin(target_regions)]
print(regional)To exclude specific values, combine isin() with the ~ operator:
# Exclude returned and pending orders
active = df[~df['status'].isin(['returned', 'pending'])]
print(active)isin() also works with other iterables such as sets, NumPy arrays, and even another pandas Series. For large lists of values, passing a set instead of a list can be slightly faster because set lookups are O(1).
The query() Method for SQL-Like Filtering
The query() method lets you write filter conditions as strings, similar to a SQL WHERE clause. This improves readability, especially when combining multiple conditions.
# Simple condition
df.query("amount > 500")
# Multiple conditions
df.query("amount > 500 and region == 'East'")
# Using variables with @
min_amount = 800
df.query("amount >= @min_amount and status == 'completed'")The query() method uses the numexpr engine under the hood when available, which makes it faster than boolean indexing on large DataFrames. It also avoids the verbose df['column'] repetition.
# Compare readability:
# Boolean indexing
result = df[(df['amount'] > 500) & (df['region'] == 'East') & (df['status'] == 'completed')]
# query() - same result, much cleaner
result = df.query("amount > 500 and region == 'East' and status == 'completed'")For more about the query method, see our pandas query guide.
String Search with str.contains() and Regex
Text columns often need partial matching, pattern matching, or case-insensitive search. The .str accessor provides a set of vectorized string methods for this.
Substring Search
# Find products containing "Phone" (case-insensitive)
phone_products = df[df['product'].str.contains('phone', case=False, na=False)]
print(phone_products)Always pass na=False to avoid errors when the column contains missing values. Without it, str.contains() returns NaN for null entries, which breaks the boolean filter.
Regex Patterns
# Products that start with "L" or "M"
pattern = r'^[LM]'
filtered = df[df['product'].str.contains(pattern, regex=True, na=False)]
print(filtered)# Find customers whose names end with a vowel
vowel_names = df[df['customer'].str.contains(r'[aeiou]$', case=False, regex=True, na=False)]
print(vowel_names)Other String Methods
# Starts with
df[df['product'].str.startswith('L')]
# Ends with
df[df['region'].str.endswith('th')]
# Exact length
df[df['customer'].str.len() == 3]
# Match (full string regex match)
df[df['status'].str.match(r'comp.*')]For more on pandas string operations, see our pandas string operations guide.
Pandas where() vs Boolean Indexing
The where() method works differently from boolean indexing. Instead of removing rows that don't match, it replaces non-matching values with NaN (or a value you specify), keeping the original DataFrame shape intact.
# Boolean indexing: returns only matching rows (shape changes)
filtered = df[df['amount'] > 500]
print(f"Boolean indexing shape: {filtered.shape}") # fewer rows
# where(): keeps all rows, replaces non-matching with NaN
masked = df.where(df['amount'] > 500)
print(f"where() shape: {masked.shape}") # same shape as original
print(masked)You can also provide a replacement value:
# Replace non-matching amounts with 0 instead of NaN
df['amount'].where(df['amount'] > 500, 0)Use where() when you need to preserve the DataFrame index and shape -- for example, when aligning data with other DataFrames or when building masks for visualization. Use boolean indexing when you want a clean subset with fewer rows.
For a deeper dive, see our pandas where guide.
loc[] and iloc[] for Label and Position-Based Selection
loc[]: Label-Based Selection
loc[] filters rows by condition and selects specific columns at the same time. This is its main advantage over plain boolean indexing.
# Filter rows AND select specific columns
result = df.loc[df['amount'] > 500, ['customer', 'product', 'amount']]
print(result)# Filter with multiple conditions, select two columns
result = df.loc[
(df['region'] == 'East') & (df['status'] == 'completed'),
['order_id', 'amount']
]
print(result)iloc[]: Position-Based Selection
iloc[] selects by integer position. It does not support boolean conditions directly, but you can combine it with NumPy:
# Select the first 3 rows, columns 0 and 3
df.iloc[:3, [0, 3]]
# Select every other row
df.iloc[::2]For a full guide on loc, see our pandas loc guide.
Select Columns with filter()
The filter() method selects columns (or rows) by name patterns. It does not filter rows by value -- instead, it filters column labels.
# Select columns that contain "er"
df.filter(like='er', axis=1)
# Returns: customer, order_id columns
# Select columns matching a regex
df.filter(regex=r'^(product|amount)$', axis=1)
# Select specific columns by name
df.filter(items=['customer', 'amount', 'region'])This is useful when you want to select columns from a wide DataFrame without listing each column name individually.
Multi-Column Filtering with &, |, ~
Complex real-world filters usually involve multiple columns. Use & (AND), | (OR), and ~ (NOT) operators. Always wrap each condition in parentheses.
# Completed orders over $500 from East or North regions
result = df[
(df['status'] == 'completed') &
(df['amount'] > 500) &
(df['region'].isin(['East', 'North']))
]
print(result)# Orders that are either high-value OR from a specific customer
result = df[
(df['amount'] > 1000) | (df['customer'] == 'Bob')
]
print(result)A common mistake is using Python's and/or keywords instead of &/|. Python's and/or evaluate entire Series as a single boolean, which raises a ValueError. Always use the bitwise operators with parentheses.
# WRONG - raises ValueError
# df[df['amount'] > 500 and df['region'] == 'East']
# CORRECT
df[(df['amount'] > 500) & (df['region'] == 'East')]Filtering Missing Values with isna() and notna()
Missing data is common in real datasets. Pandas provides isna() (alias isnull()) and notna() (alias notnull()) to filter based on missing values.
# Rows where customer is missing
missing_customer = df[df['customer'].isna()]
print(missing_customer)
# Rows where customer is NOT missing
valid_customer = df[df['customer'].notna()]
print(valid_customer)You can combine missing value checks with other conditions:
# Completed orders with a known customer
result = df[(df['customer'].notna()) & (df['status'] == 'completed')]
print(result)For strategies on handling missing data beyond filtering, see our pandas missing values guide and pandas fillna guide.
Searching Across All Columns
Sometimes you need to find a value that could appear in any column of the DataFrame, not just one specific column.
# Find rows where any column equals "Laptop"
rows_with_laptop = df[df.eq('Laptop').any(axis=1)]
print(rows_with_laptop)For string search across all string columns:
# Search for "East" in any column
mask = df.apply(lambda col: col.astype(str).str.contains('East', na=False))
df[mask.any(axis=1)]This approach is slower than targeting a specific column, so use it only when you genuinely don't know which column contains the value.
Method Comparison Table
| Method | Best For | Modifies Shape | Supports Regex | Relative Speed |
|---|---|---|---|---|
| Boolean indexing | Simple conditions on any dtype | Yes (fewer rows) | No | Fast |
isin() | Matching against a list of values | Yes | No | Fast |
query() | Multi-condition filters, readability | Yes | No | Fastest (large data) |
str.contains() | Substring/regex search in text | Yes | Yes | Medium |
where() | Preserve shape, replace non-matches | No (same shape) | No | Medium |
loc[] | Row filter + column selection | Yes | No | Fast |
filter() | Select columns by name pattern | Yes (fewer columns) | Yes | Fast |
isna() / notna() | Missing value detection | Yes | No | Fast |
Choose query() for large DataFrames with complex numeric conditions. Use str.contains() when working with text patterns. Use loc[] when you need to filter rows and select columns simultaneously.
Performance Benchmarks
Different methods have different performance characteristics, especially on large datasets. Here is a benchmark on a DataFrame with 2 million rows:
import time
# Create a large DataFrame
np.random.seed(42)
n = 2_000_000
df_large = pd.DataFrame({
'id': range(n),
'value': np.random.randint(0, 10000, n),
'category': np.random.choice(['A', 'B', 'C', 'D', 'E'], n),
'text': np.random.choice(['alpha', 'beta', 'gamma', 'delta', 'epsilon'], n),
'score': np.random.random(n)
})
# Boolean indexing
start = time.time()
_ = df_large[df_large['value'] > 5000]
t_bool = time.time() - start
# query()
start = time.time()
_ = df_large.query('value > 5000')
t_query = time.time() - start
# isin()
start = time.time()
_ = df_large[df_large['category'].isin(['A', 'B'])]
t_isin = time.time() - start
# str.contains()
start = time.time()
_ = df_large[df_large['text'].str.contains('alp', na=False)]
t_str = time.time() - start
# loc[]
start = time.time()
_ = df_large.loc[df_large['value'] > 5000, ['id', 'value']]
t_loc = time.time() - start
print(f"Boolean indexing: {t_bool:.4f}s")
print(f"query(): {t_query:.4f}s")
print(f"isin(): {t_isin:.4f}s")
print(f"str.contains(): {t_str:.4f}s")
print(f"loc[]: {t_loc:.4f}s")Typical results on a modern machine (2M rows):
| Method | Time (approx.) | Notes |
|---|---|---|
| Boolean indexing | ~15ms | Baseline for numeric comparison |
query() | ~10ms | 30-40% faster with numexpr |
isin() | ~20ms | Depends on list size |
str.contains() | ~300ms | String ops are inherently slower |
loc[] | ~16ms | Similar to boolean indexing |
Key takeaways for performance:
- For numeric filters on large data, prefer
query()when numexpr is installed. str.contains()is 10-20x slower than numeric comparisons. If you need to filter text columns repeatedly, convert to a categorical type or build an index.- Converting a column to
categorydtype before filtering withisin()can speed up lookups by 2-5x on repeated queries. - Avoid
apply()with lambda functions for filtering -- vectorized methods are always faster.
# Speed up repeated text filtering with category dtype
df_large['category'] = df_large['category'].astype('category')
start = time.time()
_ = df_large[df_large['category'].isin(['A', 'B'])]
print(f"isin() with category dtype: {time.time() - start:.4f}s")Real-World Examples
Example 1: Filtering Sales Data for a Report
# Load sales data (imagine 500K rows)
sales = pd.DataFrame({
'date': pd.date_range('2025-01-01', periods=1000, freq='h'),
'rep': np.random.choice(['Kim', 'Jordan', 'Alex', 'Sam'], 1000),
'product': np.random.choice(['Pro Plan', 'Basic Plan', 'Enterprise'], 1000),
'revenue': np.random.uniform(50, 5000, 1000).round(2),
'region': np.random.choice(['US-East', 'US-West', 'EU', 'APAC'], 1000)
})
# Q1 2025 Enterprise deals over $1000 in US regions
q1_enterprise = sales.query(
"date >= '2025-01-01' and date < '2025-04-01' "
"and product == 'Enterprise' "
"and revenue > 1000 "
"and region.str.startswith('US')"
)
print(f"Found {len(q1_enterprise)} matching deals")
print(q1_enterprise.head())Example 2: Log File Analysis
# Parse server logs into a DataFrame
logs = pd.DataFrame({
'timestamp': pd.date_range('2025-01-01', periods=5000, freq='min'),
'level': np.random.choice(['INFO', 'WARNING', 'ERROR', 'DEBUG'], 5000, p=[0.6, 0.2, 0.1, 0.1]),
'message': np.random.choice([
'Request completed successfully',
'Connection timeout to database',
'NullPointerException in UserService',
'Cache miss for key user_123',
'Disk usage above 90%'
], 5000),
'response_time_ms': np.random.exponential(200, 5000).round(0)
})
# Find errors with slow response times
critical = logs[
(logs['level'] == 'ERROR') &
(logs['response_time_ms'] > 500)
]
# Find all log entries mentioning "timeout" or "exception" (case-insensitive)
issues = logs[logs['message'].str.contains(r'timeout|exception', case=False, regex=True, na=False)]
print(f"Critical errors: {len(critical)}")
print(f"Timeout/exception entries: {len(issues)}")Example 3: Cleaning Survey Data
survey = pd.DataFrame({
'respondent_id': range(1, 8),
'age': [25, np.nan, 34, 150, 29, -5, 42],
'income': [50000, 62000, np.nan, 85000, 0, 47000, 91000],
'satisfaction': ['High', 'Medium', 'High', '', 'Low', np.nan, 'Medium']
})
# Remove invalid ages (must be between 0 and 120)
valid_age = survey[(survey['age'].notna()) & (survey['age'].between(0, 120))]
# Remove rows where satisfaction is empty or missing
valid_survey = valid_age[
(valid_age['satisfaction'].notna()) &
(valid_age['satisfaction'] != '')
]
print(valid_survey)Visualize Filtered Data with PyGWalker
After filtering a DataFrame, the next step is usually to explore the results visually. PyGWalker (opens in a new tab) turns any pandas DataFrame into an interactive, Tableau-like visual interface inside Jupyter Notebook. Instead of writing matplotlib or seaborn code for every chart, you drag and drop columns to build visualizations instantly.
import pygwalker as pyg
# Filter then visualize
high_value = df[df['amount'] > 500]
pyg.walk(high_value)This is particularly useful after complex multi-condition filtering, where you want to quickly check distributions, spot outliers, or compare groups in the filtered data without writing additional plotting code.
If you work in Jupyter Notebooks and want an AI-powered coding assistant alongside your data analysis, RunCell (opens in a new tab) provides an AI agent that integrates directly into the notebook environment, helping you write and debug filter expressions interactively.
FAQ
How do I search for a specific value in a pandas DataFrame column?
Use boolean indexing: df[df['column'] == value]. This returns all rows where the column equals the given value. For partial text matches, use df[df['column'].str.contains('substring', na=False)]. For checking membership in a list, use df[df['column'].isin([val1, val2])].
What is the fastest way to filter a large pandas DataFrame?
For numeric conditions, use query() with the numexpr engine -- it is 30-40% faster than boolean indexing on DataFrames with over 100,000 rows. For repeated categorical filtering, convert the column to category dtype first. Avoid apply() with lambda functions; vectorized operations are always faster.
What is the difference between where() and boolean indexing in pandas?
Boolean indexing (df[df['col'] > 5]) returns only the rows that satisfy the condition, reducing the number of rows in the result. where() keeps the original DataFrame shape but replaces non-matching values with NaN. Use where() when you need to preserve index alignment or DataFrame dimensions.
How do I filter pandas DataFrame rows with multiple conditions?
Combine conditions using & (AND), | (OR), and ~ (NOT) bitwise operators. Wrap each condition in parentheses: df[(df['A'] > 10) & (df['B'] == 'X')]. For cleaner syntax, use query(): df.query("A > 10 and B == 'X'"). Do not use Python's and/or keywords -- they cause a ValueError on Series.
How do I find and filter rows with missing values in pandas?
Use df[df['column'].isna()] to find rows where a column is NaN, and df[df['column'].notna()] to keep only non-null rows. To check for missing values across the entire DataFrame, use df[df.isna().any(axis=1)]. For filling missing values instead of filtering, see fillna() or interpolate().
Conclusion
Searching and filtering values in a pandas DataFrame is one of the most fundamental operations in data analysis. This guide covered the full range of methods available:
- Boolean indexing for direct comparisons and the most common use case
isin()for matching against a list of valuesquery()for readable, fast multi-condition filtering on large datasetsstr.contains()for substring and regex search in text columnswhere()for preserving DataFrame shape while masking valuesloc[]for combined row filtering and column selectionfilter()for selecting columns by name patternsisna()/notna()for handling missing data
For small to medium DataFrames, boolean indexing is the simplest and most readable approach. For large DataFrames with complex conditions, query() gives the best performance. For text search, str.contains() with regex support covers everything from simple substring matching to advanced pattern extraction.
Related Pandas Guides
- Pandas Filter Rows by Condition
- Pandas query() Method
- Pandas where() Guide
- Pandas loc[] Selection
- Pandas String Operations
- Pandas Missing Values
- Pandas fillna()
- Pandas Drop Duplicates
- Pandas Rename Column
- Pandas Sort Values