Skip to content

Python Regex: The Complete Guide to Regular Expressions in Python

Updated on

You have a 50,000-line log file and need to extract every IP address. Or a spreadsheet column full of phone numbers in six different formats that all need to become one. Or user input that must match a strict email pattern before it reaches your database. Solving any of these with str.find() and str.replace() chains turns into fragile, unreadable code within minutes. Python regex -- regular expressions through the built-in re module -- gives you a single, expressive language for matching, searching, extracting, and transforming text patterns of any complexity.

📚

This guide walks through every major feature of Python's re module, from basic pattern matching to advanced constructs like lookahead assertions and compiled patterns. Each section includes working code you can copy directly into a Python script or Jupyter notebook. By the end, you will have a solid reference for solving real-world text processing tasks with regular expressions.

What Is a Regular Expression?

A regular expression (regex) is a sequence of characters that defines a search pattern. Python implements regex through the re module in the standard library -- no installation required. You write a pattern string, pass it to a function like re.search() or re.findall(), and the regex engine scans your target text for matches.

import re
 
text = "Order #12345 was placed on 2026-02-11"
match = re.search(r'\d+', text)
print(match.group())  # Output: 12345

The r prefix creates a raw string, which prevents Python from interpreting backslashes as escape characters. Always use raw strings for regex patterns.

Core Functions in the re Module

The re module provides several functions for different use cases. Here is a quick reference before the detailed walkthrough.

FunctionPurposeReturns
re.search(pattern, string)Find the first match anywhere in the stringMatch object or None
re.match(pattern, string)Match only at the beginning of the stringMatch object or None
re.fullmatch(pattern, string)Match the entire string against the patternMatch object or None
re.findall(pattern, string)Find all non-overlapping matchesList of strings or tuples
re.finditer(pattern, string)Find all matches as an iteratorIterator of Match objects
re.sub(pattern, repl, string)Replace matches with a replacement stringModified string
re.split(pattern, string)Split the string at each matchList of strings
re.compile(pattern)Compile a pattern for repeated usePattern object

re.search() -- Find the First Match

re.search() scans the entire string and returns a Match object for the first occurrence. It returns None if no match is found.

import re
 
log_line = "ERROR 2026-02-11 14:32:01 - Connection timeout at 192.168.1.50"
 
# Find the first IP address
match = re.search(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', log_line)
if match:
    print(f"IP found: {match.group()}")    # IP found: 192.168.1.50
    print(f"Position: {match.start()}-{match.end()}")  # Position: 49-61

The Match object provides .group() for the matched text, .start() and .end() for positions, and .span() for both as a tuple.

re.match() vs re.search()

A common source of confusion: re.match() only checks at the beginning of the string.

import re
 
text = "The price is $49.99"
 
# re.match() fails -- pattern is not at the start
result_match = re.match(r'\$\d+\.\d+', text)
print(result_match)  # None
 
# re.search() succeeds -- scans the whole string
result_search = re.search(r'\$\d+\.\d+', text)
print(result_search.group())  # $49.99

Use re.match() when you know the pattern must appear at position 0. Use re.search() for everything else.

re.findall() -- Extract All Matches

re.findall() returns a list of all non-overlapping matches. This is often the function you reach for first during data extraction tasks.

import re
 
text = "Contact: alice@example.com, bob@company.org, support@service.net"
 
emails = re.findall(r'[\w.+-]+@[\w-]+\.[\w.-]+', text)
print(emails)
# ['alice@example.com', 'bob@company.org', 'support@service.net']

When the pattern contains capturing groups, findall() returns the group contents instead of the full match:

import re
 
text = "Width: 1920px, Height: 1080px"
 
# Without groups -- returns full matches
values = re.findall(r'\d+px', text)
print(values)  # ['1920px', '1080px']
 
# With a capturing group -- returns only the group
values = re.findall(r'(\d+)px', text)
print(values)  # ['1920', '1080']

re.finditer() -- Memory-Efficient Match Iteration

For large texts, re.finditer() returns an iterator of Match objects instead of building a list in memory.

import re
 
log = """
ERROR 2026-02-11 10:00:01 Disk full
WARN  2026-02-11 10:01:15 High memory
ERROR 2026-02-11 10:02:30 Connection lost
INFO  2026-02-11 10:03:45 Service restarted
"""
 
for match in re.finditer(r'ERROR\s+(\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})\s+(.*)', log):
    timestamp, message = match.group(1), match.group(2)
    print(f"[{timestamp}] {message}")
 
# [2026-02-11 10:00:01] Disk full
# [2026-02-11 10:02:30] Connection lost

re.sub() -- Search and Replace

re.sub() replaces every match of the pattern with a replacement string. It supports backreferences and callable replacements.

import re
 
# Basic replacement
text = "Published: 02/11/2026"
result = re.sub(r'(\d{2})/(\d{2})/(\d{4})', r'\3-\1-\2', text)
print(result)  # Published: 2026-02-11
 
# Using a function for dynamic replacement
def censor_email(match):
    user, domain = match.group(1), match.group(2)
    return f"{user[0]}***@{domain}"
 
text = "Send to alice@example.com or bob@company.org"
result = re.sub(r'([\w.+-]+)@([\w.-]+)', censor_email, text)
print(result)  # Send to a***@example.com or b***@company.org

re.split() -- Split on a Pattern

re.split() splits a string wherever the pattern matches, which is more powerful than str.split().

import re
 
# Split on multiple delimiters
data = "apple, banana;cherry  date|elderberry"
items = re.split(r'[,;\s|]+', data)
print(items)  # ['apple', 'banana', 'cherry', 'date', 'elderberry']
 
# Limit the number of splits
result = re.split(r'\s+', "one two three four five", maxsplit=2)
print(result)  # ['one', 'two', 'three four five']

Regex Syntax: Metacharacters, Character Classes, and Quantifiers

Understanding the building blocks of regex patterns is essential for writing effective expressions.

Metacharacters Reference

MetacharacterMeaningExampleMatches
.Any character except newlinea.cabc, a1c, a-c
^Start of string (or line with MULTILINE)^HelloHello world
$End of string (or line with MULTILINE)world$Hello world
\dAny digit [0-9]\d{3}123, 456
\DAny non-digit\D+abc, hello
\wWord character [a-zA-Z0-9_]\w+hello_world
\WNon-word character\W@, #, space
\sWhitespace [ \t\n\r\f\v]\s+spaces, tabs
\SNon-whitespace\S+hello
\bWord boundary\bcat\bcat but not catch
|Alternation (OR)cat|dogcat or dog
\\Escape a metacharacter\\.literal .

Character Classes

Square brackets define a set of characters to match.

import re
 
text = "My phone: (555) 867-5309, ext. 42"
 
# Match digits only
digits = re.findall(r'[0-9]+', text)
print(digits)  # ['555', '867', '5309', '42']
 
# Match lowercase vowels
vowels = re.findall(r'[aeiou]', "Hello World")
print(vowels)  # ['e', 'o', 'o']
 
# Negated class -- match anything NOT a digit
non_digits = re.findall(r'[^0-9]+', "abc123def456")
print(non_digits)  # ['abc', 'def']
 
# Range -- match hex characters
hex_chars = re.findall(r'[0-9a-fA-F]+', "Color: #FF5733")
print(hex_chars)  # ['5733'] -- wait, let's fix that
hex_chars = re.findall(r'[0-9a-fA-F]+', "Color: #FF5733")
print(hex_chars)  # ['0', '9', 'FF5733'] -- 'a', 'F' in class
# More precise: match hex color codes
hex_colors = re.findall(r'#[0-9a-fA-F]{6}', "Color: #FF5733 and #00AAFF")
print(hex_colors)  # ['#FF5733', '#00AAFF']

Quantifiers

Quantifiers control how many times a preceding element must appear.

QuantifierMeaningExample
*0 or more (greedy)\d* matches "", "1", "123"
+1 or more (greedy)\d+ matches "1", "123" but not ""
?0 or 1 (optional)colou?r matches color and colour
{n}Exactly n times\d{4} matches 2026
{n,}n or more times\d{2,} matches 12, 123, 1234
{n,m}Between n and m times\d{2,4} matches 12, 123, 1234

Greedy vs Lazy Matching

By default, quantifiers are greedy -- they match as much text as possible. Adding ? after a quantifier makes it lazy (matches as little as possible).

import re
 
html = "<b>bold</b> and <i>italic</i>"
 
# Greedy -- matches from first < to last >
greedy = re.findall(r'<.*>', html)
print(greedy)  # ['<b>bold</b> and <i>italic</i>']
 
# Lazy -- matches the smallest possible chunk
lazy = re.findall(r'<.*?>', html)
print(lazy)    # ['<b>', '</b>', '<i>', '</i>']

This distinction is critical when parsing HTML, XML, or any text with nested delimiters.

Groups and Capturing

Parentheses create capturing groups that let you extract specific parts of a match.

Basic Groups

import re
 
text = "Date: 2026-02-11"
match = re.search(r'(\d{4})-(\d{2})-(\d{2})', text)
 
if match:
    print(match.group(0))  # Full match: 2026-02-11
    print(match.group(1))  # First group: 2026
    print(match.group(2))  # Second group: 02
    print(match.group(3))  # Third group: 11
    print(match.groups())  # All groups: ('2026', '02', '11')

Named Groups

Named groups improve readability, especially in complex patterns.

import re
 
pattern = r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})'
match = re.search(pattern, "Event date: 2026-02-11")
 
if match:
    print(match.group('year'))   # 2026
    print(match.group('month'))  # 02
    print(match.group('day'))    # 11
    print(match.groupdict())     # {'year': '2026', 'month': '02', 'day': '11'}

Named groups are particularly useful when parsing structured text like log files or CSV data, where referring to groups by index makes code hard to maintain.

Non-Capturing Groups

When you need grouping for alternation or quantifiers but do not need to capture the content, use (?:...).

import re
 
# Capturing group -- appears in findall results
result = re.findall(r'(https?|ftp)://\S+', "Visit https://example.com")
print(result)  # ['https']
 
# Non-capturing group -- full match returned
result = re.findall(r'(?:https?|ftp)://\S+', "Visit https://example.com")
print(result)  # ['https://example.com']

Backreferences

Backreferences match the same text that was previously captured by a group.

import re
 
# Find repeated words
text = "This is is a test test of repeated words"
duplicates = re.findall(r'\b(\w+)\s+\1\b', text)
print(duplicates)  # ['is', 'test']
 
# Using named backreference
duplicates = re.findall(r'\b(?P<word>\w+)\s+(?P=word)\b', text)
print(duplicates)  # ['is', 'test']

Lookahead and Lookbehind Assertions

Lookahead and lookbehind are zero-width assertions -- they check what comes before or after the current position without consuming characters.

SyntaxNameMeaning
(?=...)Positive lookaheadFollowed by ...
(?!...)Negative lookaheadNOT followed by ...
(?<=...)Positive lookbehindPreceded by ...
(?<!...)Negative lookbehindNOT preceded by ...
import re
 
# Positive lookahead: find numbers followed by "px"
text = "width: 200px; height: 150px; margin: 10em"
pixels = re.findall(r'\d+(?=px)', text)
print(pixels)  # ['200', '150']
 
# Negative lookahead: find numbers NOT followed by "px"
others = re.findall(r'\d+(?!px)\b', text)
print(others)  # ['10']
 
# Positive lookbehind: extract prices after "$"
prices = "Items: $19.99, $5.50, 100 points"
amounts = re.findall(r'(?<=\$)\d+\.\d{2}', prices)
print(amounts)  # ['19.99', '5.50']
 
# Negative lookbehind: match "test" not preceded by "unit"
text = "unittest, integration test, system test"
matches = re.findall(r'(?<!unit)test', text)
print(matches)  # ['test', 'test']

Password Validation with Lookahead

A classic use case for lookahead is enforcing multiple conditions on a single string.

import re
 
def validate_password(password):
    """
    Requires:
    - At least 8 characters
    - At least one uppercase letter
    - At least one lowercase letter
    - At least one digit
    - At least one special character
    """
    pattern = r'^(?=.*[A-Z])(?=.*[a-z])(?=.*\d)(?=.*[!@#$%^&*]).{8,}$'
    return bool(re.match(pattern, password))
 
print(validate_password("Weak"))          # False
print(validate_password("Str0ng!Pass"))   # True
print(validate_password("nouppercase1!")) # False

Compiling Patterns with re.compile()

When you use the same pattern repeatedly, re.compile() creates a reusable Pattern object. This avoids recompiling the pattern on every call and makes the code cleaner.

import re
 
# Compile once, use many times
email_pattern = re.compile(r'[\w.+-]+@[\w-]+\.[\w.-]+')
 
texts = [
    "Contact alice@example.com for details",
    "No email here",
    "Send to bob@company.org and carol@service.net",
]
 
for text in texts:
    matches = email_pattern.findall(text)
    if matches:
        print(f"Found: {matches}")
 
# Found: ['alice@example.com']
# Found: ['bob@company.org', 'carol@service.net']

Performance Benefit

Python caches recently used patterns internally (up to 512 entries), so the performance gain from re.compile() is modest for simple scripts. However, re.compile() becomes important when:

  • You use the same pattern in a tight loop processing millions of records
  • You want to attach flags to the pattern once rather than passing them every time
  • You want to store patterns as module-level constants for readability
import re
 
# Compile with flags
LOG_PATTERN = re.compile(
    r'^(?P<level>ERROR|WARN|INFO)\s+'
    r'(?P<timestamp>\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})\s+'
    r'(?P<message>.+)$',
    re.MULTILINE
)
 
log_data = """ERROR 2026-02-11 08:15:00 Disk usage at 95%
INFO 2026-02-11 08:16:00 Cleanup started
WARN 2026-02-11 08:17:00 Slow query detected"""
 
for match in LOG_PATTERN.finditer(log_data):
    print(match.groupdict())
 
# {'level': 'ERROR', 'timestamp': '2026-02-11 08:15:00', 'message': 'Disk usage at 95%'}
# {'level': 'INFO', 'timestamp': '2026-02-11 08:16:00', 'message': 'Cleanup started'}
# {'level': 'WARN', 'timestamp': '2026-02-11 08:17:00', 'message': 'Slow query detected'}

Regex Flags

Flags modify how the regex engine interprets a pattern. You can pass them to any re function or combine them with the bitwise OR operator |.

FlagShort FormEffect
re.IGNORECASEre.ICase-insensitive matching
re.MULTILINEre.M^ and $ match at line boundaries
re.DOTALLre.S. matches newline characters too
re.VERBOSEre.XAllow comments and whitespace in patterns
re.ASCIIre.A\w, \d, \s match ASCII only

re.VERBOSE for Readable Patterns

Complex patterns become much easier to understand with re.VERBOSE.

import re
 
# Without VERBOSE -- hard to read
url_pattern = r'https?://(?:www\.)?[\w.-]+\.\w{2,}(?:/[\w./-]*)?(?:\?[\w=&]*)?'
 
# With VERBOSE -- same pattern, much clearer
url_pattern = re.compile(r"""
    https?://           # Protocol (http or https)
    (?:www\.)?          # Optional www prefix
    [\w.-]+             # Domain name
    \.\w{2,}            # Top-level domain (.com, .org, etc.)
    (?:/[\w./-]*)?      # Optional path
    (?:\?[\w=&]*)?      # Optional query string
""", re.VERBOSE)
 
text = "Visit https://www.example.com/path?key=value or http://test.org"
print(url_pattern.findall(text))
# ['https://www.example.com/path?key=value', 'http://test.org']

Combining Flags

import re
 
text = """First line
SECOND LINE
third line"""
 
# Case-insensitive + multiline
matches = re.findall(r'^.*line$', text, re.IGNORECASE | re.MULTILINE)
print(matches)  # ['First line', 'third line']
 
# Add DOTALL to make . match newlines too
match = re.search(r'First.*third', text, re.DOTALL)
print(match.group())  # 'First line\nSECOND LINE\nthird'

Common Regex Patterns Reference

Here is a collection of battle-tested patterns for common validation and extraction tasks.

Use CasePatternNotes
Email (basic)[\w.+-]+@[\w-]+\.[\w.-]+Covers most real-world emails
Phone (US)(?:\+1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}Handles multiple formats
URLhttps?://[\w.-]+(?:\.[\w]{2,})(?:/[\w./?#&=-]*)?HTTP and HTTPS
IPv4 address\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}Does not validate range
Date (YYYY-MM-DD)\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[01])Validates month and day ranges
Time (HH:MM:SS)(?:[01]\d|2[0-3]):[0-5]\d:[0-5]\d24-hour format
Hex color#[0-9a-fA-F]{6}(?:[0-9a-fA-F]{2})?6 or 8 digit hex
Integer[-+]?\d+With optional sign
Float[-+]?\d*\.?\d+(?:[eE][-+]?\d+)?Scientific notation supported
HTML tag<([a-zA-Z][\w]*)\b[^>]*>.*?</\1>Basic matching only

Real-World Examples

Parsing Log Files

Log analysis is one of the most common regex tasks in production environments.

import re
from collections import Counter
 
log_data = """
192.168.1.10 - - [11/Feb/2026:10:00:01 +0000] "GET /api/users HTTP/1.1" 200 1234
10.0.0.5 - - [11/Feb/2026:10:00:02 +0000] "POST /api/login HTTP/1.1" 401 89
192.168.1.10 - - [11/Feb/2026:10:00:03 +0000] "GET /api/data HTTP/1.1" 200 5678
172.16.0.1 - - [11/Feb/2026:10:00:04 +0000] "GET /api/users HTTP/1.1" 500 45
10.0.0.5 - - [11/Feb/2026:10:00:05 +0000] "POST /api/login HTTP/1.1" 401 89
"""
 
# Extract all IP addresses and status codes
pattern = re.compile(r'(\d+\.\d+\.\d+\.\d+).*?"(\w+)\s+(\S+).*?"\s+(\d{3})')
 
for match in pattern.finditer(log_data):
    ip, method, path, status = match.groups()
    print(f"{ip} | {method} {path} | Status: {status}")
 
# Count failed login attempts per IP
failed_logins = re.findall(
    r'(\d+\.\d+\.\d+\.\d+).*?POST /api/login.*?\s401\s',
    log_data
)
print(f"\nFailed login attempts by IP: {Counter(failed_logins)}")
# Counter({'10.0.0.5': 2})

Data Cleaning with re.sub()

When working with messy datasets, regex-based cleaning is faster and more reliable than chaining string methods.

import re
 
# Clean up messy phone numbers into a standard format
phones = [
    "(555) 867-5309",
    "555.867.5309",
    "555 867 5309",
    "+1-555-867-5309",
    "5558675309",
]
 
def standardize_phone(phone):
    digits = re.sub(r'\D', '', phone)       # Remove all non-digits
    if len(digits) == 11 and digits[0] == '1':
        digits = digits[1:]                  # Strip country code
    if len(digits) == 10:
        return f"({digits[:3]}) {digits[3:6]}-{digits[6:]}"
    return phone  # Return original if it doesn't fit
 
for p in phones:
    print(f"{p:>20s}  ->  {standardize_phone(p)}")
 
# (555) 867-5309       ->  (555) 867-5309
#      555.867.5309     ->  (555) 867-5309
#      555 867 5309     ->  (555) 867-5309
# +1-555-867-5309      ->  (555) 867-5309
#       5558675309      ->  (555) 867-5309

Input Validation

import re
 
def validate_email(email):
    """Validate an email address with a practical regex."""
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return bool(re.match(pattern, email))
 
def validate_date(date_str):
    """Validate YYYY-MM-DD format with basic range checking."""
    pattern = r'^(\d{4})-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$'
    match = re.match(pattern, date_str)
    if not match:
        return False
    year, month, day = int(match.group(1)), int(match.group(2)), int(match.group(3))
    return 1900 <= year <= 2100
 
test_emails = ["user@example.com", "bad@", "test.user+tag@domain.co.uk", "@missing.com"]
for email in test_emails:
    print(f"{email:30s} -> {validate_email(email)}")
 
test_dates = ["2026-02-11", "2026-13-01", "2026-02-30", "abcd-ef-gh"]
for date in test_dates:
    print(f"{date:15s} -> {validate_date(date)}")

Extracting Structured Data

import re
 
# Parse a markdown table into a list of dictionaries
markdown_table = """
| Name    | Age | City       |
|---------|-----|------------|
| Alice   | 30  | New York   |
| Bob     | 25  | London     |
| Charlie | 35  | Tokyo      |
"""
 
# Skip the header separator row, extract data rows
rows = re.findall(r'^\|\s*(\w+)\s*\|\s*(\d+)\s*\|\s*([\w\s]+?)\s*\|$',
                  markdown_table, re.MULTILINE)
 
data = [{'name': name, 'age': int(age), 'city': city} for name, age, city in rows]
print(data)
# [{'name': 'Alice', 'age': 30, 'city': 'New York'},
#  {'name': 'Bob', 'age': 25, 'city': 'London'},
#  {'name': 'Charlie', 'age': 35, 'city': 'Tokyo'}]

If you regularly parse and explore structured data in notebooks, tools like RunCell (opens in a new tab) can streamline this workflow. RunCell is an AI agent that runs inside Jupyter, helping you write, debug, and iterate on regex patterns and data extraction code interactively.

Common Mistakes and Pitfalls

1. Forgetting Raw Strings

Without the r prefix, Python interprets backslashes as escape characters before the regex engine sees them.

import re
 
# Wrong -- \b is interpreted as a backspace character
match = re.search('\bword\b', 'a word here')
print(match)  # None
 
# Correct -- raw string preserves \b for the regex engine
match = re.search(r'\bword\b', 'a word here')
print(match.group())  # word

2. Using re.match() When You Mean re.search()

re.match() only checks the beginning of the string. This catches many beginners off guard.

import re
 
# This returns None because "error" is not at position 0
result = re.match(r'error', "System error detected")
print(result)  # None
 
# Use re.search() or anchor with ^
result = re.search(r'error', "System error detected")
print(result.group())  # error

3. Greedy Matching Capturing Too Much

import re
 
# Greedy .* grabs everything between the FIRST { and LAST }
text = "{first} and {second}"
result = re.search(r'\{.*\}', text)
print(result.group())  # {first} and {second}
 
# Fix with lazy quantifier
result = re.search(r'\{.*?\}', text)
print(result.group())  # {first}

4. Not Escaping Special Characters

Characters like ., *, +, ?, (, ), [, ], {, }, ^, $, |, and \ have special meanings. If you want to match them literally, escape with \ or use re.escape().

import re
 
# Wrong -- . matches any character
result = re.findall(r'3.14', "3.14 and 3x14")
print(result)  # ['3.14', '3x14']
 
# Correct -- \. matches a literal dot
result = re.findall(r'3\.14', "3.14 and 3x14")
print(result)  # ['3.14']
 
# Use re.escape() for user-provided strings
user_input = "price is $5.00 (USD)"
safe_pattern = re.escape(user_input)
print(safe_pattern)  # price\ is\ \$5\.00\ \(USD\)

5. Catastrophic Backtracking

Certain patterns can cause the regex engine to take exponential time on specific inputs. Avoid nested quantifiers on overlapping character classes.

import re
import time
 
# Dangerous pattern -- nested quantifiers
# pattern = r'(a+)+b'  # DO NOT use on untrusted input
 
# Safe alternative -- flatten the quantifier
pattern = r'a+b'
 
text = "a" * 30 + "b"
start = time.time()
re.search(pattern, text)
print(f"Safe pattern took: {time.time() - start:.4f}s")

Regex in Data Science Workflows

Regular expressions are especially useful in data science for cleaning and transforming text columns in pandas DataFrames.

import re
import pandas as pd
 
df = pd.DataFrame({
    'raw_phone': ['(555) 123-4567', '555.123.4567', '555 123 4567', '+1-555-123-4567'],
    'raw_text': ['Price: $12.99', 'Cost is $8.50!', 'Only $199.00 left', 'Free (was $25)']
})
 
# Extract digits from phone numbers
df['clean_phone'] = df['raw_phone'].str.replace(r'\D', '', regex=True)
 
# Extract prices from text
df['price'] = df['raw_text'].str.extract(r'\$(\d+\.\d{2})')
 
print(df)

For interactive data exploration and visualization after cleaning, PyGWalker (opens in a new tab) turns any pandas DataFrame into a drag-and-drop visual interface -- no chart code needed.

Summary

Python's re module provides a complete regex implementation that handles text matching, extraction, replacement, and splitting. The key points to remember:

  • Use raw strings (r'...') for all regex patterns
  • re.search() finds the first match anywhere; re.match() only checks the start
  • re.findall() extracts all matches; re.finditer() is memory-efficient for large texts
  • Capturing groups () extract sub-patterns; named groups (?P<name>...) improve readability
  • Lookahead and lookbehind assert context without consuming characters
  • re.compile() improves readability and performance for reused patterns
  • re.VERBOSE makes complex patterns maintainable with comments
  • Always test patterns against edge cases, and watch out for greedy matching and catastrophic backtracking

Regular expressions are a foundational skill for anyone working with text data in Python. Whether you are parsing logs, validating input, or cleaning datasets, mastering the re module will save you significant time and effort.

📚