Python Regular Expressions (re Module) – A Practical Guide
Python Regular Expressions (re Module) – A Practical Guide
Discover how to harness the power of regular expressions in Python with the re module. Step‑by‑step examples illustrate common patterns, matching techniques, and advanced functions.
What Is a Regular Expression?
A regular expression (RegEx) is a sequence of characters that defines a search pattern. For example, the pattern ^a...s$ matches any five‑letter string that starts with a and ends with s.
^a...s$
Patterns can be used to match against strings. The following table demonstrates how the pattern behaves with different inputs:
| Expression | String | Matched? |
|---|---|---|
^a...s$ | abs | No match |
alias | Match | |
abyss | Match | |
Alias | No match | |
An abacus | No match |
Python’s re module provides the tools you need to work with RegEx. Here’s a quick example that uses re.match() to check a pattern against a string:
import re
pattern = '^a...s$'
test_string = 'abyss'
result = re.match(pattern, test_string)
if result:
print("Search successful.")
else:
print("Search unsuccessful.")
When the pattern is found, re.match() returns a match object; otherwise it returns None.
Specifying Patterns with Metacharacters
Metacharacters are special symbols that the regex engine interprets in a unique way. Below is a quick reference for the most common metacharacters:
[] . ^ $ * + ? {} () \ |
Square Brackets: []
Square brackets define a set of characters to match. For instance, [abc] matches any single occurrence of a, b, or c.
| Expression | String | Matched? |
|---|---|---|
[abc] | a | 1 match |
ac | 2 matches | |
Hey Jude | No match | |
abc de ca | 5 matches |
Ranges can be expressed with a hyphen, e.g., [a-e] equals [abcde]. To invert a set, place a caret ^ after the opening bracket, e.g., [^abc] matches any character except a, b, or c.
Period: .
The dot matches any single character except a newline.
| Expression | String | Matched? |
|---|---|---|
.. | a | No match |
ac | 1 match | |
acd | 1 match | |
acde | 2 matches (contains 4 characters) |
Caret: ^ and Dollar: $
The caret asserts the start of a string; the dollar sign asserts the end. For example, ^a matches any string beginning with a, while a$ matches any string ending with a.
| Expression | String | Matched? |
|---|---|---|
^a | a | 1 match |
abc | 1 match | |
bac | No match | |
^ab | abc | 1 match |
acb | No match (starts with a but not followed by b) |
| Expression | String | Matched? |
|---|---|---|
a$ | a | 1 match |
formula | 1 match | |
cab | No match |
Quantifiers: *, +, ?, {n,m}
These symbols control how many times the preceding element should appear.
*– zero or more times.+– one or more times.?– zero or one time.{n,m}– at least n and at most m times.
Examples:
| Expression | String | Matched? |
|---|---|---|
ma*n | mn | 1 match |
man | 1 match | |
maaan | 1 match | |
main | No match (a is not followed by n) | |
woman | 1 match |
| Expression | String | Matched? |
|---|---|---|
ma+n | mn | No match (no a) |
man | 1 match | |
maaan | 1 match | |
main | No match (a is not followed by n) | |
woman | 1 match |
| Expression | String | Matched? |
|---|---|---|
ma?n | mn | 1 match |
man | 1 match | |
maaan | No match (more than one a) | |
main | No match (a is not followed by n) | |
woman | 1 match |
| Expression | String | Matched? |
|---|---|---|
a{2,3} | abc dat | No match |
abc daat | 1 match (at daat) | |
aabc daaat | 2 matches (at aabc and daaat) | |
aabc daaaat | 2 matches (at aabc and daaaat) |
Commonly used pattern: [0-9]{2,4} matches two to four consecutive digits.
| Expression | String | Matched? |
|---|---|---|
[0-9]{2,4} | ab123csde | 1 match (at ab123csde) |
12 and 345673 | 3 matches (12, 3456, 73) | |
1 and 2 | No match |
Alternation: |
The vertical bar implements logical OR. For example, a|b matches any string containing a or b.
| Expression | String | Matched? |
|---|---|---|
a|b | cde | No match |
ade | 1 match (at ade) | |
acdbea | 3 matches (at acdbea) |
Grouping: ()
Parentheses group sub‑patterns and capture matched substrings. Example: (a|b|c)xz matches any string that contains a, b, or c followed by xz.
| Expression | String | Matched? |
|---|---|---|
(a|b|c)xz | ab xz | No match |
abxz | 1 match (at abxz) | |
axz cabxz | 2 matches (at axzbc cabxz) |
Escaping: \
Use a backslash to treat a metacharacter as a literal. For instance, \$a matches the literal sequence $a rather than interpreting $ as the end‑of‑string anchor.
Special Sequences
Special sequences simplify common patterns. Below is a reference list with illustrative examples.
Start of String: \A
| Expression | String | Matched? |
|---|---|---|
\Athe | the sun | Match |
In the sun | No match |
Word Boundary: \b
| Expression | String | Matched? |
|---|---|---|
\bfoo | football | Match |
a football | Match | |
afootball | No match | |
foo\b | the foo | Match |
the afoo test | Match | |
the afootest | No match |
Non‑Word Boundary: \B
| Expression | String | Matched? |
|---|---|---|
\Bfoo | football | No match |
a football | No match | |
afootball | Match | |
foo\B | the foo | No match |
the afoo test | No match | |
the afootest | Match |
Digit: \d / Non‑Digit: \D
| Expression | String | Matched? |
|---|---|---|
\d | 12abc3 | 3 matches (at 12abc3) |
Python | No match | |
\D | 1ab34"50 | 3 matches (at 1ab34"50) |
1345 | No match |
Whitespace: \s / Non‑Whitespace: \S
| Expression | String | Matched? |
|---|---|---|
\s | Python RegEx | 1 match |
PythonRegEx | No match | |
\S | a b | 2 matches (at a b) |
| No match |
Word Character: \w / Non‑Word: \W
| Expression | String | Matched? |
|---|---|---|
\w | 12&": ;c | 3 matches (at 12": ;c) |
"%> ! | No match | |
\W | 1a2%c | 1 match (at 1a2%c) |
Python | No match |
End of String: \Z
| Expression | String | Matched? |
|---|---|---|
Python\Z | I like Python | 1 match |
I like Python Programming | No match | |
Python is fun. | No match |
Tip: Use an online regex tester like regex101 to craft and debug patterns quickly.
Using RegEx in Python
Python’s re module offers a rich set of functions and constants for regex operations. Import it with:
import re
Below are the most common utilities and how to use them.
re.findall()
Returns a list of all non‑overlapping matches in a string.
# Extract all numbers from a string
import re
string = 'hello 12 hi 89. Howdy 34'
pattern = '\d+'
result = re.findall(pattern, string)
print(result)
# Output: ['12', '89', '34']
When no match is found, an empty list is returned.
re.split()
Splits a string at each point where the pattern matches, returning a list of substrings.
import re
string = 'Twelve:12 Eighty nine:89.'
pattern = '\d+'
result = re.split(pattern, string)
print(result)
# Output: ['Twelve:', ' Eighty nine:', '.']
Use the optional maxsplit argument to limit the number of splits. A value of 0 (the default) splits at every match.
import re
string = 'Twelve:12 Eighty nine:89 Nine:9.'
pattern = '\d+'
# Split only at the first occurrence
result = re.split(pattern, string, 1)
print(result)
# Output: ['Twelve:', ' Eighty nine:89 Nine:9.']
re.sub()
Replaces all occurrences of a pattern with a replacement string.
# Remove all whitespace characters
import re
string = 'abc 12\nde 23 \n f45 6'
pattern = '\s+'
replace = ''
new_string = re.sub(pattern, replace, string)
print(new_string)
# Output: abc12de23f456
To limit replacements, provide a count argument. A value of 0 replaces every match.
new_string = re.sub(r'\s+', replace, string, 1)
print(new_string)
# Output:
# abc12de 23
# f45 6
re.subn()
Like re.sub(), but returns a tuple containing the new string and the number of substitutions performed.
new_string = re.subn(pattern, replace, string)
print(new_string)
# Output: ('abc12de23f456', 4)
re.search()
Finds the first location where the pattern matches.
string = "Python is fun"
match = re.search('\\APython', string)
if match:
print("pattern found inside the string")
else:
print("pattern not found")
# Output: pattern found inside the string
Match Object
When a match is found, re.search() returns a Match object. Common methods and attributes include:
group()– the matched substring.start()/end()– indices of the match.span()– a tuple of (start, end).groups()– all captured groups.
string = '39801 356, 2102 1111'
pattern = '(\\d{3}) (\\d{2})'
match = re.search(pattern, string)
if match:
print(match.group())
print(match.group(1))
print(match.group(2))
print(match.groups())
print(match.start(), match.end(), match.span())
else:
print("pattern not found")
# Output:
# 801 35
# 801
# 35
# ('801', '35')
# 2 8 (2, 8)
Using Raw Strings (prefix r)
Prefixing a string literal with r treats backslashes as literal characters, preventing accidental escape sequences. This is especially useful in regex patterns.
string = '\n and \r are escape sequences.'
result = re.findall(r'[\n\r]', string)
print(result)
# Output: ['\n', '\r']
For deeper exploration of the re module, refer to the official Python documentation.
Python
- Python Keywords and Identifiers: Mastering Reserved Words and Naming Conventions
- Mastering Python Operators: A Comprehensive Guide
- Python List Operations: Creation, Access, Modification, and Advanced Techniques
- Mastering Python Tuples: Creation, Access, and Advanced Operations
- Mastering Python Dictionaries: Creation, Manipulation, and Advanced Techniques
- Mastering Python's strftime(): Convert Dates and Times to Readable Strings
- Master Python's strptime() for Accurate Date Parsing
- Mastering Python’s time Module: Functions, Structs, and Practical Examples
- Master Python Regular Expressions: re.match(), re.search(), re.findall() – Practical Examples
- Master Python Regular Expressions: A Practical Guide