What is the difference between re.search() and re.match() in Python?

re.match() only checks for a match at the beginning of the string (position 0), while re.search() scans the entire string and returns the first match found anywhere. For most use cases, re.search() is the safer choice. Use re.match() only when you specifically need to verify that a pattern starts at the beginning of the string.

What does the r prefix mean in Python regex patterns?

The r prefix creates a raw string literal, which tells Python not to interpret backslashes as escape sequences. This is important for regex because patterns use backslashes extensively (like \d for digits, \b for word boundaries). Without the r prefix, Python would interpret \b as a backspace character instead of passing it to the regex engine.

How do I replace text using regex in Python?

Use re.sub(pattern, replacement, string) to replace all occurrences of a pattern. The replacement can be a string (with backreferences like \1 for captured groups) or a callable function that receives each Match object. For example: re.sub(r'(\d{2})/(\d{2})/(\d{4})', r'\3-\1-\2', text) reformats dates from MM/DD/YYYY to YYYY-MM-DD.

What is the difference between greedy and lazy matching in Python regex?

Greedy quantifiers (*, +, ?) match as much text as possible, while lazy quantifiers (*?, +?, ??) match as little as possible. For example, given the text ' bold ', the greedy pattern ' ' matches the entire string ' bold ', while the lazy pattern ' ' matches only ' '. Add ? after any quantifier to make it lazy.

How do I compile a regex pattern in Python for better performance?

Use re.compile(pattern, flags) to create a reusable Pattern object. Call methods like .search(), .findall(), and .sub() directly on the compiled object. While Python caches recent patterns internally, re.compile() is recommended when you reuse the same pattern in loops, want to store patterns as module-level constants, or need to attach flags to the pattern once.

Python Regex: The Complete Guide to Regular Expressions in Python

Q: How do I use Python regex to find all matches in a string?

Use re.findall(pattern, string) to get a list of all non-overlapping matches. If your pattern contains capturing groups, findall returns the group contents instead of the full match. For Match objects with position information, use re.finditer(pattern, string) which returns an iterator.

Name: Soren Atelier

Updated on 2/11/2026

You have a 50,000-line log file and need to extract every IP address. Or a spreadsheet column full of phone numbers in six different formats that all need to become one. Or user input that must match a strict email pattern before it reaches your database. Solving any of these with str.find() and str.replace() chains turns into fragile, unreadable code within minutes. Python regex -- regular expressions through the built-in re module -- gives you a single, expressive language for matching, searching, extracting, and transforming text patterns of any complexity.

What Is a Regular Expression?

A regular expression (regex) is a sequence of characters that defines a search pattern. Python implements regex through the re module in the standard library -- no installation required. You write a pattern string, pass it to a function like re.search() or re.findall(), and the regex engine scans your target text for matches.

import re
 
text = "Order #12345 was placed on 2026-02-11"
match = re.search(r'\d+', text)
print(match.group())  # Output: 12345

The r prefix creates a raw string, which prevents Python from interpreting backslashes as escape characters. Always use raw strings for regex patterns.

Core Functions in the re Module

The re module provides several functions for different use cases. Here is a quick reference before the detailed walkthrough.

Function	Purpose	Returns
`re.search(pattern, string)`	Find the first match anywhere in the string	`Match` object or `None`
`re.match(pattern, string)`	Match only at the beginning of the string	`Match` object or `None`
`re.fullmatch(pattern, string)`	Match the entire string against the pattern	`Match` object or `None`
`re.findall(pattern, string)`	Find all non-overlapping matches	List of strings or tuples
`re.finditer(pattern, string)`	Find all matches as an iterator	Iterator of `Match` objects
`re.sub(pattern, repl, string)`	Replace matches with a replacement string	Modified string
`re.split(pattern, string)`	Split the string at each match	List of strings
`re.compile(pattern)`	Compile a pattern for repeated use	`Pattern` object

re.search() -- Find the First Match

re.search() scans the entire string and returns a Match object for the first occurrence. It returns None if no match is found.

import re
 
log_line = "ERROR 2026-02-11 14:32:01 - Connection timeout at 192.168.1.50"
 
# Find the first IP address
match = re.search(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', log_line)
if match:
    print(f"IP found: {match.group()}")    # IP found: 192.168.1.50
    print(f"Position: {match.start()}-{match.end()}")  # Position: 49-61

The Match object provides .group() for the matched text, .start() and .end() for positions, and .span() for both as a tuple.

re.match() vs re.search()

A common source of confusion: re.match() only checks at the beginning of the string.

import re
 
text = "The price is $49.99"
 
# re.match() fails -- pattern is not at the start
result_match = re.match(r'\$\d+\.\d+', text)
print(result_match)  # None
 
# re.search() succeeds -- scans the whole string
result_search = re.search(r'\$\d+\.\d+', text)
print(result_search.group())  # $49.99

Use re.match() when you know the pattern must appear at position 0. Use re.search() for everything else.

re.findall() -- Extract All Matches

re.findall() returns a list of all non-overlapping matches. This is often the function you reach for first during data extraction tasks.

import re
 
text = "Contact: alice@example.com, bob@company.org, support@service.net"
 
emails = re.findall(r'[\w.+-]+@[\w-]+\.[\w.-]+', text)
print(emails)
# ['alice@example.com', 'bob@company.org', 'support@service.net']

When the pattern contains capturing groups, findall() returns the group contents instead of the full match:

import re
 
text = "Width: 1920px, Height: 1080px"
 
# Without groups -- returns full matches
values = re.findall(r'\d+px', text)
print(values)  # ['1920px', '1080px']
 
# With a capturing group -- returns only the group
values = re.findall(r'(\d+)px', text)
print(values)  # ['1920', '1080']

re.finditer() -- Memory-Efficient Match Iteration

For large texts, re.finditer() returns an iterator of Match objects instead of building a list in memory.

import re
 
log = """
ERROR 2026-02-11 10:00:01 Disk full
WARN  2026-02-11 10:01:15 High memory
ERROR 2026-02-11 10:02:30 Connection lost
INFO  2026-02-11 10:03:45 Service restarted
"""
 
for match in re.finditer(r'ERROR\s+(\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})\s+(.*)', log):
    timestamp, message = match.group(1), match.group(2)
    print(f"[{timestamp}] {message}")
 
# [2026-02-11 10:00:01] Disk full
# [2026-02-11 10:02:30] Connection lost

re.sub() -- Search and Replace

re.sub() replaces every match of the pattern with a replacement string. It supports backreferences and callable replacements.

import re
 
# Basic replacement
text = "Published: 02/11/2026"
result = re.sub(r'(\d{2})/(\d{2})/(\d{4})', r'\3-\1-\2', text)
print(result)  # Published: 2026-02-11
 
# Using a function for dynamic replacement
def censor_email(match):
    user, domain = match.group(1), match.group(2)
    return f"{user[0]}***@{domain}"
 
text = "Send to alice@example.com or bob@company.org"
result = re.sub(r'([\w.+-]+)@([\w.-]+)', censor_email, text)
print(result)  # Send to a***@example.com or b***@company.org

re.split() -- Split on a Pattern

re.split() splits a string wherever the pattern matches, which is more powerful than str.split().

import re
 
# Split on multiple delimiters
data = "apple, banana;cherry  date|elderberry"
items = re.split(r'[,;\s|]+', data)
print(items)  # ['apple', 'banana', 'cherry', 'date', 'elderberry']
 
# Limit the number of splits
result = re.split(r'\s+', "one two three four five", maxsplit=2)
print(result)  # ['one', 'two', 'three four five']

Regex Syntax: Metacharacters, Character Classes, and Quantifiers

Understanding the building blocks of regex patterns is essential for writing effective expressions.

Metacharacters Reference

Metacharacter	Meaning	Example	Matches
`.`	Any character except newline	`a.c`	`abc`, `a1c`, `a-c`
`^`	Start of string (or line with `MULTILINE`)	`^Hello`	`Hello world`
`$`	End of string (or line with `MULTILINE`)	`world$`	`Hello world`
`\d`	Any digit `[0-9]`	`\d{3}`	`123`, `456`
`\D`	Any non-digit	`\D+`	`abc`, `hello`
`\w`	Word character `[a-zA-Z0-9_]`	`\w+`	`hello_world`
`\W`	Non-word character	`\W`	`@`, `#`, space
`\s`	Whitespace `[ \t\n\r\f\v]`	`\s+`	spaces, tabs
`\S`	Non-whitespace	`\S+`	`hello`
`\b`	Word boundary	`\bcat\b`	`cat` but not `catch`
`\|`	Alternation (OR)	`cat\|dog`	`cat` or `dog`
`\\`	Escape a metacharacter	`\\.`	literal `.`

Character Classes

Square brackets define a set of characters to match.

import re
 
text = "My phone: (555) 867-5309, ext. 42"
 
# Match digits only
digits = re.findall(r'[0-9]+', text)
print(digits)  # ['555', '867', '5309', '42']
 
# Match lowercase vowels
vowels = re.findall(r'[aeiou]', "Hello World")
print(vowels)  # ['e', 'o', 'o']
 
# Negated class -- match anything NOT a digit
non_digits = re.findall(r'[^0-9]+', "abc123def456")
print(non_digits)  # ['abc', 'def']
 
# Range -- match hex characters
hex_chars = re.findall(r'[0-9a-fA-F]+', "Color: #FF5733")
print(hex_chars)  # ['5733'] -- wait, let's fix that
hex_chars = re.findall(r'[0-9a-fA-F]+', "Color: #FF5733")
print(hex_chars)  # ['0', '9', 'FF5733'] -- 'a', 'F' in class
# More precise: match hex color codes
hex_colors = re.findall(r'#[0-9a-fA-F]{6}', "Color: #FF5733 and #00AAFF")
print(hex_colors)  # ['#FF5733', '#00AAFF']

Quantifiers

Quantifiers control how many times a preceding element must appear.

Quantifier	Meaning	Example
`*`	0 or more (greedy)	`\d*` matches `""`, `"1"`, `"123"`
`+`	1 or more (greedy)	`\d+` matches `"1"`, `"123"` but not `""`
`?`	0 or 1 (optional)	`colou?r` matches `color` and `colour`
`{n}`	Exactly n times	`\d{4}` matches `2026`
`{n,}`	n or more times	`\d{2,}` matches `12`, `123`, `1234`
`{n,m}`	Between n and m times	`\d{2,4}` matches `12`, `123`, `1234`

Greedy vs Lazy Matching

By default, quantifiers are greedy -- they match as much text as possible. Adding ? after a quantifier makes it lazy (matches as little as possible).

import re
 
html = "<b>bold</b> and <i>italic</i>"
 
# Greedy -- matches from first < to last >
greedy = re.findall(r'<.*>', html)
print(greedy)  # ['<b>bold</b> and <i>italic</i>']
 
# Lazy -- matches the smallest possible chunk
lazy = re.findall(r'<.*?>', html)
print(lazy)    # ['<b>', '</b>', '<i>', '</i>']

This distinction is critical when parsing HTML, XML, or any text with nested delimiters.

Groups and Capturing

Parentheses create capturing groups that let you extract specific parts of a match.

Basic Groups

import re
 
text = "Date: 2026-02-11"
match = re.search(r'(\d{4})-(\d{2})-(\d{2})', text)
 
if match:
    print(match.group(0))  # Full match: 2026-02-11
    print(match.group(1))  # First group: 2026
    print(match.group(2))  # Second group: 02
    print(match.group(3))  # Third group: 11
    print(match.groups())  # All groups: ('2026', '02', '11')

Named Groups

Named groups improve readability, especially in complex patterns.

import re
 
pattern = r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})'
match = re.search(pattern, "Event date: 2026-02-11")
 
if match:
    print(match.group('year'))   # 2026
    print(match.group('month'))  # 02
    print(match.group('day'))    # 11
    print(match.groupdict())     # {'year': '2026', 'month': '02', 'day': '11'}

Named groups are particularly useful when parsing structured text like log files or CSV data, where referring to groups by index makes code hard to maintain.

Non-Capturing Groups

When you need grouping for alternation or quantifiers but do not need to capture the content, use (?:...).

import re
 
# Capturing group -- appears in findall results
result = re.findall(r'(https?|ftp)://\S+', "Visit https://example.com")
print(result)  # ['https']
 
# Non-capturing group -- full match returned
result = re.findall(r'(?:https?|ftp)://\S+', "Visit https://example.com")
print(result)  # ['https://example.com']

Backreferences

Backreferences match the same text that was previously captured by a group.

import re
 
# Find repeated words
text = "This is is a test test of repeated words"
duplicates = re.findall(r'\b(\w+)\s+\1\b', text)
print(duplicates)  # ['is', 'test']
 
# Using named backreference
duplicates = re.findall(r'\b(?P<word>\w+)\s+(?P=word)\b', text)
print(duplicates)  # ['is', 'test']

Lookahead and Lookbehind Assertions

Lookahead and lookbehind are zero-width assertions -- they check what comes before or after the current position without consuming characters.

Syntax	Name	Meaning
`(?=...)`	Positive lookahead	Followed by ...
`(?!...)`	Negative lookahead	NOT followed by ...
`(?<=...)`	Positive lookbehind	Preceded by ...
`(?<!...)`	Negative lookbehind	NOT preceded by ...

import re
 
# Positive lookahead: find numbers followed by "px"
text = "width: 200px; height: 150px; margin: 10em"
pixels = re.findall(r'\d+(?=px)', text)
print(pixels)  # ['200', '150']
 
# Negative lookahead: find numbers NOT followed by "px"
others = re.findall(r'\d+(?!px)\b', text)
print(others)  # ['10']
 
# Positive lookbehind: extract prices after "$"
prices = "Items: $19.99, $5.50, 100 points"
amounts = re.findall(r'(?<=\$)\d+\.\d{2}', prices)
print(amounts)  # ['19.99', '5.50']
 
# Negative lookbehind: match "test" not preceded by "unit"
text = "unittest, integration test, system test"
matches = re.findall(r'(?<!unit)test', text)
print(matches)  # ['test', 'test']

Password Validation with Lookahead

A classic use case for lookahead is enforcing multiple conditions on a single string.

import re
 
def validate_password(password):
    """
    Requires:
    - At least 8 characters
    - At least one uppercase letter
    - At least one lowercase letter
    - At least one digit
    - At least one special character
    """
    pattern = r'^(?=.*[A-Z])(?=.*[a-z])(?=.*\d)(?=.*[!@#$%^&*]).{8,}$'
    return bool(re.match(pattern, password))
 
print(validate_password("Weak"))          # False
print(validate_password("Str0ng!Pass"))   # True
print(validate_password("nouppercase1!")) # False

Compiling Patterns with re.compile()

When you use the same pattern repeatedly, re.compile() creates a reusable Pattern object. This avoids recompiling the pattern on every call and makes the code cleaner.

import re
 
# Compile once, use many times
email_pattern = re.compile(r'[\w.+-]+@[\w-]+\.[\w.-]+')
 
texts = [
    "Contact alice@example.com for details",
    "No email here",
    "Send to bob@company.org and carol@service.net",
]
 
for text in texts:
    matches = email_pattern.findall(text)
    if matches:
        print(f"Found: {matches}")
 
# Found: ['alice@example.com']
# Found: ['bob@company.org', 'carol@service.net']

Performance Benefit

Python caches recently used patterns internally (up to 512 entries), so the performance gain from re.compile() is modest for simple scripts. However, re.compile() becomes important when:

You use the same pattern in a tight loop processing millions of records
You want to attach flags to the pattern once rather than passing them every time
You want to store patterns as module-level constants for readability

import re
 
# Compile with flags
LOG_PATTERN = re.compile(
    r'^(?P<level>ERROR|WARN|INFO)\s+'
    r'(?P<timestamp>\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})\s+'
    r'(?P<message>.+)$',
    re.MULTILINE
)
 
log_data = """ERROR 2026-02-11 08:15:00 Disk usage at 95%
INFO 2026-02-11 08:16:00 Cleanup started
WARN 2026-02-11 08:17:00 Slow query detected"""
 
for match in LOG_PATTERN.finditer(log_data):
    print(match.groupdict())
 
# {'level': 'ERROR', 'timestamp': '2026-02-11 08:15:00', 'message': 'Disk usage at 95%'}
# {'level': 'INFO', 'timestamp': '2026-02-11 08:16:00', 'message': 'Cleanup started'}
# {'level': 'WARN', 'timestamp': '2026-02-11 08:17:00', 'message': 'Slow query detected'}

Regex Flags

Flags modify how the regex engine interprets a pattern. You can pass them to any re function or combine them with the bitwise OR operator |.

Flag	Short Form	Effect
`re.IGNORECASE`	`re.I`	Case-insensitive matching
`re.MULTILINE`	`re.M`	`^` and `$` match at line boundaries
`re.DOTALL`	`re.S`	`.` matches newline characters too
`re.VERBOSE`	`re.X`	Allow comments and whitespace in patterns
`re.ASCII`	`re.A`	`\w`, `\d`, `\s` match ASCII only

re.VERBOSE for Readable Patterns

Complex patterns become much easier to understand with re.VERBOSE.

import re
 
# Without VERBOSE -- hard to read
url_pattern = r'https?://(?:www\.)?[\w.-]+\.\w{2,}(?:/[\w./-]*)?(?:\?[\w=&]*)?'
 
# With VERBOSE -- same pattern, much clearer
url_pattern = re.compile(r"""
    https?://           # Protocol (http or https)
    (?:www\.)?          # Optional www prefix
    [\w.-]+             # Domain name
    \.\w{2,}            # Top-level domain (.com, .org, etc.)
    (?:/[\w./-]*)?      # Optional path
    (?:\?[\w=&]*)?      # Optional query string
""", re.VERBOSE)
 
text = "Visit https://www.example.com/path?key=value or http://test.org"
print(url_pattern.findall(text))
# ['https://www.example.com/path?key=value', 'http://test.org']

Combining Flags

import re
 
text = """First line
SECOND LINE
third line"""
 
# Case-insensitive + multiline
matches = re.findall(r'^.*line$', text, re.IGNORECASE | re.MULTILINE)
print(matches)  # ['First line', 'third line']
 
# Add DOTALL to make . match newlines too
match = re.search(r'First.*third', text, re.DOTALL)
print(match.group())  # 'First line\nSECOND LINE\nthird'

Common Regex Patterns Reference

Here is a collection of battle-tested patterns for common validation and extraction tasks.

Use Case	Pattern	Notes
Email (basic)	`[\w.+-]+@[\w-]+\.[\w.-]+`	Covers most real-world emails
Phone (US)	`(?:\+1[-.\s]?)?$?\d{3}$?[-.\s]?\d{3}[-.\s]?\d{4}`	Handles multiple formats
URL	`https?://[\w.-]+(?:\.[\w]{2,})(?:/[\w./?#&=-]*)?`	HTTP and HTTPS
IPv4 address	`\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}`	Does not validate range
Date (YYYY-MM-DD)	`\d{4}-(?:0[1-9]\|1[0-2])-(?:0[1-9]\|[12]\d\|3[01])`	Validates month and day ranges
Time (HH:MM:SS)	`(?:[01]\d\|2[0-3]):[0-5]\d:[0-5]\d`	24-hour format
Hex color	`#[0-9a-fA-F]{6}(?:[0-9a-fA-F]{2})?`	6 or 8 digit hex
Integer	`[-+]?\d+`	With optional sign
Float	`[-+]?\d*\.?\d+(?:[eE][-+]?\d+)?`	Scientific notation supported
HTML tag	`<([a-zA-Z][\w])\b[^>]>.*?</\1>`	Basic matching only

Real-World Examples

Parsing Log Files

Log analysis is one of the most common regex tasks in production environments.

import re
from collections import Counter
 
log_data = """
192.168.1.10 - - [11/Feb/2026:10:00:01 +0000] "GET /api/users HTTP/1.1" 200 1234
10.0.0.5 - - [11/Feb/2026:10:00:02 +0000] "POST /api/login HTTP/1.1" 401 89
192.168.1.10 - - [11/Feb/2026:10:00:03 +0000] "GET /api/data HTTP/1.1" 200 5678
172.16.0.1 - - [11/Feb/2026:10:00:04 +0000] "GET /api/users HTTP/1.1" 500 45
10.0.0.5 - - [11/Feb/2026:10:00:05 +0000] "POST /api/login HTTP/1.1" 401 89
"""
 
# Extract all IP addresses and status codes
pattern = re.compile(r'(\d+\.\d+\.\d+\.\d+).*?"(\w+)\s+(\S+).*?"\s+(\d{3})')
 
for match in pattern.finditer(log_data):
    ip, method, path, status = match.groups()
    print(f"{ip} | {method} {path} | Status: {status}")
 
# Count failed login attempts per IP
failed_logins = re.findall(
    r'(\d+\.\d+\.\d+\.\d+).*?POST /api/login.*?\s401\s',
    log_data
)
print(f"\nFailed login attempts by IP: {Counter(failed_logins)}")
# Counter({'10.0.0.5': 2})

Data Cleaning with re.sub()

When working with messy datasets, regex-based cleaning is faster and more reliable than chaining string methods.

import re
 
# Clean up messy phone numbers into a standard format
phones = [
    "(555) 867-5309",
    "555.867.5309",
    "555 867 5309",
    "+1-555-867-5309",
    "5558675309",
]
 
def standardize_phone(phone):
    digits = re.sub(r'\D', '', phone)       # Remove all non-digits
    if len(digits) == 11 and digits[0] == '1':
        digits = digits[1:]                  # Strip country code
    if len(digits) == 10:
        return f"({digits[:3]}) {digits[3:6]}-{digits[6:]}"
    return phone  # Return original if it doesn't fit
 
for p in phones:
    print(f"{p:>20s}  ->  {standardize_phone(p)}")
 
# (555) 867-5309       ->  (555) 867-5309
#      555.867.5309     ->  (555) 867-5309
#      555 867 5309     ->  (555) 867-5309
# +1-555-867-5309      ->  (555) 867-5309
#       5558675309      ->  (555) 867-5309

Input Validation

import re
 
def validate_email(email):
    """Validate an email address with a practical regex."""
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return bool(re.match(pattern, email))
 
def validate_date(date_str):
    """Validate YYYY-MM-DD format with basic range checking."""
    pattern = r'^(\d{4})-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$'
    match = re.match(pattern, date_str)
    if not match:
        return False
    year, month, day = int(match.group(1)), int(match.group(2)), int(match.group(3))
    return 1900 <= year <= 2100
 
test_emails = ["user@example.com", "bad@", "test.user+tag@domain.co.uk", "@missing.com"]
for email in test_emails:
    print(f"{email:30s} -> {validate_email(email)}")
 
test_dates = ["2026-02-11", "2026-13-01", "2026-02-30", "abcd-ef-gh"]
for date in test_dates:
    print(f"{date:15s} -> {validate_date(date)}")

Extracting Structured Data

import re
 
# Parse a markdown table into a list of dictionaries
markdown_table = """
| Name    | Age | City       |
|---------|-----|------------|
| Alice   | 30  | New York   |
| Bob     | 25  | London     |
| Charlie | 35  | Tokyo      |
"""
 
# Skip the header separator row, extract data rows
rows = re.findall(r'^\|\s*(\w+)\s*\|\s*(\d+)\s*\|\s*([\w\s]+?)\s*\|$',
                  markdown_table, re.MULTILINE)
 
data = [{'name': name, 'age': int(age), 'city': city} for name, age, city in rows]
print(data)
# [{'name': 'Alice', 'age': 30, 'city': 'New York'},
#  {'name': 'Bob', 'age': 25, 'city': 'London'},
#  {'name': 'Charlie', 'age': 35, 'city': 'Tokyo'}]

If you regularly parse and explore structured data in notebooks, tools like RunCell (opens in a new tab) can streamline this workflow. RunCell is an AI agent that runs inside Jupyter, helping you write, debug, and iterate on regex patterns and data extraction code interactively.

Common Mistakes and Pitfalls

1. Forgetting Raw Strings

Without the r prefix, Python interprets backslashes as escape characters before the regex engine sees them.

import re
 
# Wrong -- \b is interpreted as a backspace character
match = re.search('\bword\b', 'a word here')
print(match)  # None
 
# Correct -- raw string preserves \b for the regex engine
match = re.search(r'\bword\b', 'a word here')
print(match.group())  # word

2. Using re.match() When You Mean re.search()

re.match() only checks the beginning of the string. This catches many beginners off guard.

import re
 
# This returns None because "error" is not at position 0
result = re.match(r'error', "System error detected")
print(result)  # None
 
# Use re.search() or anchor with ^
result = re.search(r'error', "System error detected")
print(result.group())  # error

3. Greedy Matching Capturing Too Much

import re
 
# Greedy .* grabs everything between the FIRST { and LAST }
text = "{first} and {second}"
result = re.search(r'\{.*\}', text)
print(result.group())  # {first} and {second}
 
# Fix with lazy quantifier
result = re.search(r'\{.*?\}', text)
print(result.group())  # {first}

4. Not Escaping Special Characters

Characters like ., *, +, ?, (, ), [, ], {, }, ^, $, |, and \ have special meanings. If you want to match them literally, escape with \ or use re.escape().

import re
 
# Wrong -- . matches any character
result = re.findall(r'3.14', "3.14 and 3x14")
print(result)  # ['3.14', '3x14']
 
# Correct -- \. matches a literal dot
result = re.findall(r'3\.14', "3.14 and 3x14")
print(result)  # ['3.14']
 
# Use re.escape() for user-provided strings
user_input = "price is $5.00 (USD)"
safe_pattern = re.escape(user_input)
print(safe_pattern)  # price\ is\ \$5\.00\ \(USD\)

5. Catastrophic Backtracking

Certain patterns can cause the regex engine to take exponential time on specific inputs. Avoid nested quantifiers on overlapping character classes.

import re
import time
 
# Dangerous pattern -- nested quantifiers
# pattern = r'(a+)+b'  # DO NOT use on untrusted input
 
# Safe alternative -- flatten the quantifier
pattern = r'a+b'
 
text = "a" * 30 + "b"
start = time.time()
re.search(pattern, text)
print(f"Safe pattern took: {time.time() - start:.4f}s")

Regex in Data Science Workflows

Regular expressions are especially useful in data science for cleaning and transforming text columns in pandas DataFrames.

import re
import pandas as pd
 
df = pd.DataFrame({
    'raw_phone': ['(555) 123-4567', '555.123.4567', '555 123 4567', '+1-555-123-4567'],
    'raw_text': ['Price: $12.99', 'Cost is $8.50!', 'Only $199.00 left', 'Free (was $25)']
})
 
# Extract digits from phone numbers
df['clean_phone'] = df['raw_phone'].str.replace(r'\D', '', regex=True)
 
# Extract prices from text
df['price'] = df['raw_text'].str.extract(r'\$(\d+\.\d{2})')
 
print(df)

For interactive data exploration and visualization after cleaning, PyGWalker (opens in a new tab) turns any pandas DataFrame into a drag-and-drop visual interface -- no chart code needed.

Summary

Python's re module provides a complete regex implementation that handles text matching, extraction, replacement, and splitting. The key points to remember:

Use raw strings (r'...') for all regex patterns
re.search() finds the first match anywhere; re.match() only checks the start
re.findall() extracts all matches; re.finditer() is memory-efficient for large texts
Capturing groups () extract sub-patterns; named groups (?P<name>...) improve readability
Lookahead and lookbehind assert context without consuming characters
re.compile() improves readability and performance for reused patterns
re.VERBOSE makes complex patterns maintainable with comments
Always test patterns against edge cases, and watch out for greedy matching and catastrophic backtracking

Regular expressions are a foundational skill for anyone working with text data in Python. Whether you are parsing logs, validating input, or cleaning datasets, mastering the re module will save you significant time and effort.

📚