Python: fastest way to check if a string matches a simple pattern
1. Context
In a python program, I want to see if it is possible to speed up checking for a list of string given a specific format.
In particular, I want to check a string such that:
-
the first character should be an ascii letter (lowercase or uppercase) or a digit;
-
any other character after that should be an ascii letter, a digit or any of
_. ():/,@[]
;
The constrain being to only use python standard library.
1.1. Timing it
First, I need a small decorator to be able to time a given function:
TIMINGS = {}
def timing(f):
"""Decorator to time functions"""
@functools.wraps(f)
def wrap(*args, **kw):
start = time.perf_counter()
try:
return f(*args, **kw)
finally:
end = time.perf_counter()
name = f.__name__
delta = end - start
delta_milliseconds = round(delta * 1000)
values = TIMINGS.get(name, [])
values.append(delta_milliseconds)
TIMINGS[name] = values
return wrap
def print_timings():
import statistics
for k, v in TIMINGS.items():
print(f"{k} took {round(statistics.mean(v), 3)}±{round(statistics.stdev(v), 3)} ms")
1.2. Generate data
Second, a way to generate some random data:
def random_word(length: int) -> str:
letters = string.ascii_lowercase + string.ascii_uppercase
return "".join(random.choice(letters) for i in range(length))
@timing
def generate(length: int) -> list[str]:
output = []
for _ in range(0, length):
output.append(random_word(80))
return output
1.3. First idea: use a regular expression
At first glance, the problem really looks like a regexp:
@timing
def with_regex(data: list[str]):
regexp = r"^[a-z0-9A-Z][a-z0-9A-Z\-_. ():/,@\[\]]*$"
re_name = re.compile(regexp)
for item in data:
if re.fullmatch(re_name, item):
continue
1.4. Second idea: set of characters + loop on each character
Another way to do that, would be by creating a set of characters that are valid, and then loop over each character of the string, and check if the character belongs to that set:
@timing
def with_sets(data: list[str]):
first_char = set(string.ascii_lowercase
+ string.ascii_uppercase
+ string.digits)
second_char = set(string.ascii_lowercase
+ string.ascii_uppercase
+ string.digits +
"\\-_. ():/,@[]")
for item in data:
if item[0] not in first_char:
continue
for i in item[1:]:
if i not in second_char:
break
1.5. Third idea: compare the numeric value of the char in the ascii table
ASCII has a peculiarity: letters and numbers are part of a range of numbers, and if we convert a character to its numerical representation, we can then compare it against the range that matches our character classes:
@timing
def with_ord(data: list[str]):
for item in data:
first_char = ord(item[0])
if not (
(48 <= first_char <= 57)
or (65 <= first_char <= 90)
or (97 <= first_char <= 122)
):
continue
for c in item[1:]:
i = ord(c)
if not (
(44 <= i <= 58)
or (64 <= i <= 93)
or (97 <= i <= 122)
or (i == 32)
or (i == 40)
or (i == 41)
):
break
1.6. Fourth idea: use strings as bytes
Instead of transforming a character to a numerical value, we can make use of all the operations on bytes in the standard library.
In particular, .translate
which allows us to remove all characters not matching
our classes, and checking if any characters remain:
@timing
def with_bytes(data: list[str]):
first_char = set(
bytes(string.ascii_lowercase, "ascii")
+ bytes(string.ascii_uppercase, "ascii")
+ bytes(string.digits, "ascii")
)
first_char = bytes(first_char)
second_char = set(
bytes(string.ascii_lowercase, "ascii")
+ bytes(string.ascii_uppercase, "ascii")
+ bytes(string.digits, "ascii")
+ "\\-_. ():/,@[]".encode("ascii", "ignore")
)
second_char = bytes(second_char)
for item in data:
if item[0].encode("ascii", "ignore").translate(None, first_char) != b"":
continue
if item.encode("ascii", "ignore").translate(None, second_char) != b"":
continue
1.7. A way to run everything
I run those multiple times in a row, and print the results:
if __name__ == "__main__":
for _ in range(0, 10):
data = generate(10_000_000)
with_regex(data)
with_sets(data)
with_ord(data)
with_bytes(data)
print_timings()
2. Performance
Results in milliseconds (ms) with standard deviation:
Solution |
Python 3.10 (ms±sd) |
Python 3.11 (ms±sd) |
with_regex |
9 362±202 |
9 535±128 |
with_sets |
19 505±333 |
14 460±190 |
with_ord |
87 463±1 850 |
43 629±544 |
with_bytes |
4 208±81 |
3 571±61 |
You can run this yourself with bench.py.