Python Regex: The Complete Guide to Regular Expressions in Python
Updated on
You have a 50,000-line log file and need to extract every IP address. Or a spreadsheet column full of phone numbers in six different formats that all need to become one. Or user input that must match a strict email pattern before it reaches your database. Solving any of these with str.find() and str.replace() chains turns into fragile, unreadable code within minutes. Python regex -- regular expressions through the built-in re module -- gives you a single, expressive language for matching, searching, extracting, and transforming text patterns of any complexity.
This guide walks through every major feature of Python's re module, from basic pattern matching to advanced constructs like lookahead assertions and compiled patterns. Each section includes working code you can copy directly into a Python script or Jupyter notebook. By the end, you will have a solid reference for solving real-world text processing tasks with regular expressions.
What Is a Regular Expression?
A regular expression (regex) is a sequence of characters that defines a search pattern. Python implements regex through the re module in the standard library -- no installation required. You write a pattern string, pass it to a function like re.search() or re.findall(), and the regex engine scans your target text for matches.
import re
text = "Order #12345 was placed on 2026-02-11"
match = re.search(r'\d+', text)
print(match.group()) # Output: 12345The r prefix creates a raw string, which prevents Python from interpreting backslashes as escape characters. Always use raw strings for regex patterns.
Core Functions in the re Module
The re module provides several functions for different use cases. Here is a quick reference before the detailed walkthrough.
| Function | Purpose | Returns |
|---|---|---|
re.search(pattern, string) | Find the first match anywhere in the string | Match object or None |
re.match(pattern, string) | Match only at the beginning of the string | Match object or None |
re.fullmatch(pattern, string) | Match the entire string against the pattern | Match object or None |
re.findall(pattern, string) | Find all non-overlapping matches | List of strings or tuples |
re.finditer(pattern, string) | Find all matches as an iterator | Iterator of Match objects |
re.sub(pattern, repl, string) | Replace matches with a replacement string | Modified string |
re.split(pattern, string) | Split the string at each match | List of strings |
re.compile(pattern) | Compile a pattern for repeated use | Pattern object |
re.search() -- Find the First Match
re.search() scans the entire string and returns a Match object for the first occurrence. It returns None if no match is found.
import re
log_line = "ERROR 2026-02-11 14:32:01 - Connection timeout at 192.168.1.50"
# Find the first IP address
match = re.search(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', log_line)
if match:
print(f"IP found: {match.group()}") # IP found: 192.168.1.50
print(f"Position: {match.start()}-{match.end()}") # Position: 49-61The Match object provides .group() for the matched text, .start() and .end() for positions, and .span() for both as a tuple.
re.match() vs re.search()
A common source of confusion: re.match() only checks at the beginning of the string.
import re
text = "The price is $49.99"
# re.match() fails -- pattern is not at the start
result_match = re.match(r'\$\d+\.\d+', text)
print(result_match) # None
# re.search() succeeds -- scans the whole string
result_search = re.search(r'\$\d+\.\d+', text)
print(result_search.group()) # $49.99Use re.match() when you know the pattern must appear at position 0. Use re.search() for everything else.
re.findall() -- Extract All Matches
re.findall() returns a list of all non-overlapping matches. This is often the function you reach for first during data extraction tasks.
import re
text = "Contact: alice@example.com, bob@company.org, support@service.net"
emails = re.findall(r'[\w.+-]+@[\w-]+\.[\w.-]+', text)
print(emails)
# ['alice@example.com', 'bob@company.org', 'support@service.net']When the pattern contains capturing groups, findall() returns the group contents instead of the full match:
import re
text = "Width: 1920px, Height: 1080px"
# Without groups -- returns full matches
values = re.findall(r'\d+px', text)
print(values) # ['1920px', '1080px']
# With a capturing group -- returns only the group
values = re.findall(r'(\d+)px', text)
print(values) # ['1920', '1080']re.finditer() -- Memory-Efficient Match Iteration
For large texts, re.finditer() returns an iterator of Match objects instead of building a list in memory.
import re
log = """
ERROR 2026-02-11 10:00:01 Disk full
WARN 2026-02-11 10:01:15 High memory
ERROR 2026-02-11 10:02:30 Connection lost
INFO 2026-02-11 10:03:45 Service restarted
"""
for match in re.finditer(r'ERROR\s+(\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})\s+(.*)', log):
timestamp, message = match.group(1), match.group(2)
print(f"[{timestamp}] {message}")
# [2026-02-11 10:00:01] Disk full
# [2026-02-11 10:02:30] Connection lostre.sub() -- Search and Replace
re.sub() replaces every match of the pattern with a replacement string. It supports backreferences and callable replacements.
import re
# Basic replacement
text = "Published: 02/11/2026"
result = re.sub(r'(\d{2})/(\d{2})/(\d{4})', r'\3-\1-\2', text)
print(result) # Published: 2026-02-11
# Using a function for dynamic replacement
def censor_email(match):
user, domain = match.group(1), match.group(2)
return f"{user[0]}***@{domain}"
text = "Send to alice@example.com or bob@company.org"
result = re.sub(r'([\w.+-]+)@([\w.-]+)', censor_email, text)
print(result) # Send to a***@example.com or b***@company.orgre.split() -- Split on a Pattern
re.split() splits a string wherever the pattern matches, which is more powerful than str.split().
import re
# Split on multiple delimiters
data = "apple, banana;cherry date|elderberry"
items = re.split(r'[,;\s|]+', data)
print(items) # ['apple', 'banana', 'cherry', 'date', 'elderberry']
# Limit the number of splits
result = re.split(r'\s+', "one two three four five", maxsplit=2)
print(result) # ['one', 'two', 'three four five']Regex Syntax: Metacharacters, Character Classes, and Quantifiers
Understanding the building blocks of regex patterns is essential for writing effective expressions.
Metacharacters Reference
| Metacharacter | Meaning | Example | Matches |
|---|---|---|---|
. | Any character except newline | a.c | abc, a1c, a-c |
^ | Start of string (or line with MULTILINE) | ^Hello | Hello world |
$ | End of string (or line with MULTILINE) | world$ | Hello world |
\d | Any digit [0-9] | \d{3} | 123, 456 |
\D | Any non-digit | \D+ | abc, hello |
\w | Word character [a-zA-Z0-9_] | \w+ | hello_world |
\W | Non-word character | \W | @, #, space |
\s | Whitespace [ \t\n\r\f\v] | \s+ | spaces, tabs |
\S | Non-whitespace | \S+ | hello |
\b | Word boundary | \bcat\b | cat but not catch |
| | Alternation (OR) | cat|dog | cat or dog |
\\ | Escape a metacharacter | \\. | literal . |
Character Classes
Square brackets define a set of characters to match.
import re
text = "My phone: (555) 867-5309, ext. 42"
# Match digits only
digits = re.findall(r'[0-9]+', text)
print(digits) # ['555', '867', '5309', '42']
# Match lowercase vowels
vowels = re.findall(r'[aeiou]', "Hello World")
print(vowels) # ['e', 'o', 'o']
# Negated class -- match anything NOT a digit
non_digits = re.findall(r'[^0-9]+', "abc123def456")
print(non_digits) # ['abc', 'def']
# Range -- match hex characters
hex_chars = re.findall(r'[0-9a-fA-F]+', "Color: #FF5733")
print(hex_chars) # ['5733'] -- wait, let's fix that
hex_chars = re.findall(r'[0-9a-fA-F]+', "Color: #FF5733")
print(hex_chars) # ['0', '9', 'FF5733'] -- 'a', 'F' in class
# More precise: match hex color codes
hex_colors = re.findall(r'#[0-9a-fA-F]{6}', "Color: #FF5733 and #00AAFF")
print(hex_colors) # ['#FF5733', '#00AAFF']Quantifiers
Quantifiers control how many times a preceding element must appear.
| Quantifier | Meaning | Example |
|---|---|---|
* | 0 or more (greedy) | \d* matches "", "1", "123" |
+ | 1 or more (greedy) | \d+ matches "1", "123" but not "" |
? | 0 or 1 (optional) | colou?r matches color and colour |
{n} | Exactly n times | \d{4} matches 2026 |
{n,} | n or more times | \d{2,} matches 12, 123, 1234 |
{n,m} | Between n and m times | \d{2,4} matches 12, 123, 1234 |
Greedy vs Lazy Matching
By default, quantifiers are greedy -- they match as much text as possible. Adding ? after a quantifier makes it lazy (matches as little as possible).
import re
html = "<b>bold</b> and <i>italic</i>"
# Greedy -- matches from first < to last >
greedy = re.findall(r'<.*>', html)
print(greedy) # ['<b>bold</b> and <i>italic</i>']
# Lazy -- matches the smallest possible chunk
lazy = re.findall(r'<.*?>', html)
print(lazy) # ['<b>', '</b>', '<i>', '</i>']This distinction is critical when parsing HTML, XML, or any text with nested delimiters.
Groups and Capturing
Parentheses create capturing groups that let you extract specific parts of a match.
Basic Groups
import re
text = "Date: 2026-02-11"
match = re.search(r'(\d{4})-(\d{2})-(\d{2})', text)
if match:
print(match.group(0)) # Full match: 2026-02-11
print(match.group(1)) # First group: 2026
print(match.group(2)) # Second group: 02
print(match.group(3)) # Third group: 11
print(match.groups()) # All groups: ('2026', '02', '11')Named Groups
Named groups improve readability, especially in complex patterns.
import re
pattern = r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})'
match = re.search(pattern, "Event date: 2026-02-11")
if match:
print(match.group('year')) # 2026
print(match.group('month')) # 02
print(match.group('day')) # 11
print(match.groupdict()) # {'year': '2026', 'month': '02', 'day': '11'}Named groups are particularly useful when parsing structured text like log files or CSV data, where referring to groups by index makes code hard to maintain.
Non-Capturing Groups
When you need grouping for alternation or quantifiers but do not need to capture the content, use (?:...).
import re
# Capturing group -- appears in findall results
result = re.findall(r'(https?|ftp)://\S+', "Visit https://example.com")
print(result) # ['https']
# Non-capturing group -- full match returned
result = re.findall(r'(?:https?|ftp)://\S+', "Visit https://example.com")
print(result) # ['https://example.com']Backreferences
Backreferences match the same text that was previously captured by a group.
import re
# Find repeated words
text = "This is is a test test of repeated words"
duplicates = re.findall(r'\b(\w+)\s+\1\b', text)
print(duplicates) # ['is', 'test']
# Using named backreference
duplicates = re.findall(r'\b(?P<word>\w+)\s+(?P=word)\b', text)
print(duplicates) # ['is', 'test']Lookahead and Lookbehind Assertions
Lookahead and lookbehind are zero-width assertions -- they check what comes before or after the current position without consuming characters.
| Syntax | Name | Meaning |
|---|---|---|
(?=...) | Positive lookahead | Followed by ... |
(?!...) | Negative lookahead | NOT followed by ... |
(?<=...) | Positive lookbehind | Preceded by ... |
(?<!...) | Negative lookbehind | NOT preceded by ... |
import re
# Positive lookahead: find numbers followed by "px"
text = "width: 200px; height: 150px; margin: 10em"
pixels = re.findall(r'\d+(?=px)', text)
print(pixels) # ['200', '150']
# Negative lookahead: find numbers NOT followed by "px"
others = re.findall(r'\d+(?!px)\b', text)
print(others) # ['10']
# Positive lookbehind: extract prices after "$"
prices = "Items: $19.99, $5.50, 100 points"
amounts = re.findall(r'(?<=\$)\d+\.\d{2}', prices)
print(amounts) # ['19.99', '5.50']
# Negative lookbehind: match "test" not preceded by "unit"
text = "unittest, integration test, system test"
matches = re.findall(r'(?<!unit)test', text)
print(matches) # ['test', 'test']Password Validation with Lookahead
A classic use case for lookahead is enforcing multiple conditions on a single string.
import re
def validate_password(password):
"""
Requires:
- At least 8 characters
- At least one uppercase letter
- At least one lowercase letter
- At least one digit
- At least one special character
"""
pattern = r'^(?=.*[A-Z])(?=.*[a-z])(?=.*\d)(?=.*[!@#$%^&*]).{8,}$'
return bool(re.match(pattern, password))
print(validate_password("Weak")) # False
print(validate_password("Str0ng!Pass")) # True
print(validate_password("nouppercase1!")) # FalseCompiling Patterns with re.compile()
When you use the same pattern repeatedly, re.compile() creates a reusable Pattern object. This avoids recompiling the pattern on every call and makes the code cleaner.
import re
# Compile once, use many times
email_pattern = re.compile(r'[\w.+-]+@[\w-]+\.[\w.-]+')
texts = [
"Contact alice@example.com for details",
"No email here",
"Send to bob@company.org and carol@service.net",
]
for text in texts:
matches = email_pattern.findall(text)
if matches:
print(f"Found: {matches}")
# Found: ['alice@example.com']
# Found: ['bob@company.org', 'carol@service.net']Performance Benefit
Python caches recently used patterns internally (up to 512 entries), so the performance gain from re.compile() is modest for simple scripts. However, re.compile() becomes important when:
- You use the same pattern in a tight loop processing millions of records
- You want to attach flags to the pattern once rather than passing them every time
- You want to store patterns as module-level constants for readability
import re
# Compile with flags
LOG_PATTERN = re.compile(
r'^(?P<level>ERROR|WARN|INFO)\s+'
r'(?P<timestamp>\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})\s+'
r'(?P<message>.+)$',
re.MULTILINE
)
log_data = """ERROR 2026-02-11 08:15:00 Disk usage at 95%
INFO 2026-02-11 08:16:00 Cleanup started
WARN 2026-02-11 08:17:00 Slow query detected"""
for match in LOG_PATTERN.finditer(log_data):
print(match.groupdict())
# {'level': 'ERROR', 'timestamp': '2026-02-11 08:15:00', 'message': 'Disk usage at 95%'}
# {'level': 'INFO', 'timestamp': '2026-02-11 08:16:00', 'message': 'Cleanup started'}
# {'level': 'WARN', 'timestamp': '2026-02-11 08:17:00', 'message': 'Slow query detected'}Regex Flags
Flags modify how the regex engine interprets a pattern. You can pass them to any re function or combine them with the bitwise OR operator |.
| Flag | Short Form | Effect |
|---|---|---|
re.IGNORECASE | re.I | Case-insensitive matching |
re.MULTILINE | re.M | ^ and $ match at line boundaries |
re.DOTALL | re.S | . matches newline characters too |
re.VERBOSE | re.X | Allow comments and whitespace in patterns |
re.ASCII | re.A | \w, \d, \s match ASCII only |
re.VERBOSE for Readable Patterns
Complex patterns become much easier to understand with re.VERBOSE.
import re
# Without VERBOSE -- hard to read
url_pattern = r'https?://(?:www\.)?[\w.-]+\.\w{2,}(?:/[\w./-]*)?(?:\?[\w=&]*)?'
# With VERBOSE -- same pattern, much clearer
url_pattern = re.compile(r"""
https?:// # Protocol (http or https)
(?:www\.)? # Optional www prefix
[\w.-]+ # Domain name
\.\w{2,} # Top-level domain (.com, .org, etc.)
(?:/[\w./-]*)? # Optional path
(?:\?[\w=&]*)? # Optional query string
""", re.VERBOSE)
text = "Visit https://www.example.com/path?key=value or http://test.org"
print(url_pattern.findall(text))
# ['https://www.example.com/path?key=value', 'http://test.org']Combining Flags
import re
text = """First line
SECOND LINE
third line"""
# Case-insensitive + multiline
matches = re.findall(r'^.*line$', text, re.IGNORECASE | re.MULTILINE)
print(matches) # ['First line', 'third line']
# Add DOTALL to make . match newlines too
match = re.search(r'First.*third', text, re.DOTALL)
print(match.group()) # 'First line\nSECOND LINE\nthird'Common Regex Patterns Reference
Here is a collection of battle-tested patterns for common validation and extraction tasks.
| Use Case | Pattern | Notes |
|---|---|---|
| Email (basic) | [\w.+-]+@[\w-]+\.[\w.-]+ | Covers most real-world emails |
| Phone (US) | (?:\+1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4} | Handles multiple formats |
| URL | https?://[\w.-]+(?:\.[\w]{2,})(?:/[\w./?#&=-]*)? | HTTP and HTTPS |
| IPv4 address | \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3} | Does not validate range |
| Date (YYYY-MM-DD) | \d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[01]) | Validates month and day ranges |
| Time (HH:MM:SS) | (?:[01]\d|2[0-3]):[0-5]\d:[0-5]\d | 24-hour format |
| Hex color | #[0-9a-fA-F]{6}(?:[0-9a-fA-F]{2})? | 6 or 8 digit hex |
| Integer | [-+]?\d+ | With optional sign |
| Float | [-+]?\d*\.?\d+(?:[eE][-+]?\d+)? | Scientific notation supported |
| HTML tag | <([a-zA-Z][\w]*)\b[^>]*>.*?</\1> | Basic matching only |
Real-World Examples
Parsing Log Files
Log analysis is one of the most common regex tasks in production environments.
import re
from collections import Counter
log_data = """
192.168.1.10 - - [11/Feb/2026:10:00:01 +0000] "GET /api/users HTTP/1.1" 200 1234
10.0.0.5 - - [11/Feb/2026:10:00:02 +0000] "POST /api/login HTTP/1.1" 401 89
192.168.1.10 - - [11/Feb/2026:10:00:03 +0000] "GET /api/data HTTP/1.1" 200 5678
172.16.0.1 - - [11/Feb/2026:10:00:04 +0000] "GET /api/users HTTP/1.1" 500 45
10.0.0.5 - - [11/Feb/2026:10:00:05 +0000] "POST /api/login HTTP/1.1" 401 89
"""
# Extract all IP addresses and status codes
pattern = re.compile(r'(\d+\.\d+\.\d+\.\d+).*?"(\w+)\s+(\S+).*?"\s+(\d{3})')
for match in pattern.finditer(log_data):
ip, method, path, status = match.groups()
print(f"{ip} | {method} {path} | Status: {status}")
# Count failed login attempts per IP
failed_logins = re.findall(
r'(\d+\.\d+\.\d+\.\d+).*?POST /api/login.*?\s401\s',
log_data
)
print(f"\nFailed login attempts by IP: {Counter(failed_logins)}")
# Counter({'10.0.0.5': 2})Data Cleaning with re.sub()
When working with messy datasets, regex-based cleaning is faster and more reliable than chaining string methods.
import re
# Clean up messy phone numbers into a standard format
phones = [
"(555) 867-5309",
"555.867.5309",
"555 867 5309",
"+1-555-867-5309",
"5558675309",
]
def standardize_phone(phone):
digits = re.sub(r'\D', '', phone) # Remove all non-digits
if len(digits) == 11 and digits[0] == '1':
digits = digits[1:] # Strip country code
if len(digits) == 10:
return f"({digits[:3]}) {digits[3:6]}-{digits[6:]}"
return phone # Return original if it doesn't fit
for p in phones:
print(f"{p:>20s} -> {standardize_phone(p)}")
# (555) 867-5309 -> (555) 867-5309
# 555.867.5309 -> (555) 867-5309
# 555 867 5309 -> (555) 867-5309
# +1-555-867-5309 -> (555) 867-5309
# 5558675309 -> (555) 867-5309Input Validation
import re
def validate_email(email):
"""Validate an email address with a practical regex."""
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
return bool(re.match(pattern, email))
def validate_date(date_str):
"""Validate YYYY-MM-DD format with basic range checking."""
pattern = r'^(\d{4})-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$'
match = re.match(pattern, date_str)
if not match:
return False
year, month, day = int(match.group(1)), int(match.group(2)), int(match.group(3))
return 1900 <= year <= 2100
test_emails = ["user@example.com", "bad@", "test.user+tag@domain.co.uk", "@missing.com"]
for email in test_emails:
print(f"{email:30s} -> {validate_email(email)}")
test_dates = ["2026-02-11", "2026-13-01", "2026-02-30", "abcd-ef-gh"]
for date in test_dates:
print(f"{date:15s} -> {validate_date(date)}")Extracting Structured Data
import re
# Parse a markdown table into a list of dictionaries
markdown_table = """
| Name | Age | City |
|---------|-----|------------|
| Alice | 30 | New York |
| Bob | 25 | London |
| Charlie | 35 | Tokyo |
"""
# Skip the header separator row, extract data rows
rows = re.findall(r'^\|\s*(\w+)\s*\|\s*(\d+)\s*\|\s*([\w\s]+?)\s*\|$',
markdown_table, re.MULTILINE)
data = [{'name': name, 'age': int(age), 'city': city} for name, age, city in rows]
print(data)
# [{'name': 'Alice', 'age': 30, 'city': 'New York'},
# {'name': 'Bob', 'age': 25, 'city': 'London'},
# {'name': 'Charlie', 'age': 35, 'city': 'Tokyo'}]If you regularly parse and explore structured data in notebooks, tools like RunCell (opens in a new tab) can streamline this workflow. RunCell is an AI agent that runs inside Jupyter, helping you write, debug, and iterate on regex patterns and data extraction code interactively.
Common Mistakes and Pitfalls
1. Forgetting Raw Strings
Without the r prefix, Python interprets backslashes as escape characters before the regex engine sees them.
import re
# Wrong -- \b is interpreted as a backspace character
match = re.search('\bword\b', 'a word here')
print(match) # None
# Correct -- raw string preserves \b for the regex engine
match = re.search(r'\bword\b', 'a word here')
print(match.group()) # word2. Using re.match() When You Mean re.search()
re.match() only checks the beginning of the string. This catches many beginners off guard.
import re
# This returns None because "error" is not at position 0
result = re.match(r'error', "System error detected")
print(result) # None
# Use re.search() or anchor with ^
result = re.search(r'error', "System error detected")
print(result.group()) # error3. Greedy Matching Capturing Too Much
import re
# Greedy .* grabs everything between the FIRST { and LAST }
text = "{first} and {second}"
result = re.search(r'\{.*\}', text)
print(result.group()) # {first} and {second}
# Fix with lazy quantifier
result = re.search(r'\{.*?\}', text)
print(result.group()) # {first}4. Not Escaping Special Characters
Characters like ., *, +, ?, (, ), [, ], {, }, ^, $, |, and \ have special meanings. If you want to match them literally, escape with \ or use re.escape().
import re
# Wrong -- . matches any character
result = re.findall(r'3.14', "3.14 and 3x14")
print(result) # ['3.14', '3x14']
# Correct -- \. matches a literal dot
result = re.findall(r'3\.14', "3.14 and 3x14")
print(result) # ['3.14']
# Use re.escape() for user-provided strings
user_input = "price is $5.00 (USD)"
safe_pattern = re.escape(user_input)
print(safe_pattern) # price\ is\ \$5\.00\ \(USD\)5. Catastrophic Backtracking
Certain patterns can cause the regex engine to take exponential time on specific inputs. Avoid nested quantifiers on overlapping character classes.
import re
import time
# Dangerous pattern -- nested quantifiers
# pattern = r'(a+)+b' # DO NOT use on untrusted input
# Safe alternative -- flatten the quantifier
pattern = r'a+b'
text = "a" * 30 + "b"
start = time.time()
re.search(pattern, text)
print(f"Safe pattern took: {time.time() - start:.4f}s")Regex in Data Science Workflows
Regular expressions are especially useful in data science for cleaning and transforming text columns in pandas DataFrames.
import re
import pandas as pd
df = pd.DataFrame({
'raw_phone': ['(555) 123-4567', '555.123.4567', '555 123 4567', '+1-555-123-4567'],
'raw_text': ['Price: $12.99', 'Cost is $8.50!', 'Only $199.00 left', 'Free (was $25)']
})
# Extract digits from phone numbers
df['clean_phone'] = df['raw_phone'].str.replace(r'\D', '', regex=True)
# Extract prices from text
df['price'] = df['raw_text'].str.extract(r'\$(\d+\.\d{2})')
print(df)For interactive data exploration and visualization after cleaning, PyGWalker (opens in a new tab) turns any pandas DataFrame into a drag-and-drop visual interface -- no chart code needed.
Summary
Python's re module provides a complete regex implementation that handles text matching, extraction, replacement, and splitting. The key points to remember:
- Use raw strings (
r'...') for all regex patterns re.search()finds the first match anywhere;re.match()only checks the startre.findall()extracts all matches;re.finditer()is memory-efficient for large texts- Capturing groups
()extract sub-patterns; named groups(?P<name>...)improve readability - Lookahead and lookbehind assert context without consuming characters
re.compile()improves readability and performance for reused patternsre.VERBOSEmakes complex patterns maintainable with comments- Always test patterns against edge cases, and watch out for greedy matching and catastrophic backtracking
Regular expressions are a foundational skill for anyone working with text data in Python. Whether you are parsing logs, validating input, or cleaning datasets, mastering the re module will save you significant time and effort.