1. Context

In a python program, I want to see if it is possible to speed up checking for a list of string given a specific format.

In particular, I want to check a string such that:

  • the first character should be an ascii letter (lowercase or uppercase) or a digit;

  • any other character after that should be an ascii letter, a digit or any of _. ():/,@[];

The constrain being to only use python standard library.

1.1. Timing it

First, I need a small decorator to be able to time a given function:

TIMINGS = {}

def timing(f):
    """Decorator to time functions"""
    @functools.wraps(f)
    def wrap(*args, **kw):
        start = time.perf_counter()
        try:
            return f(*args, **kw)
        finally:
            end = time.perf_counter()
            name = f.__name__
            delta = end - start
            delta_milliseconds = round(delta * 1000)
            values = TIMINGS.get(name, [])
            values.append(delta_milliseconds)
            TIMINGS[name] = values
    return wrap

def print_timings():
    import statistics
    for k, v in TIMINGS.items():
        print(f"{k} took {round(statistics.mean(v), 3)}±{round(statistics.stdev(v), 3)} ms")

1.2. Generate data

Second, a way to generate some random data:

def random_word(length: int) -> str:
    letters = string.ascii_lowercase + string.ascii_uppercase
    return "".join(random.choice(letters) for i in range(length))


@timing
def generate(length: int) -> list[str]:
    output = []
    for _ in range(0, length):
        output.append(random_word(80))
    return output

1.3. First idea: use a regular expression

At first glance, the problem really looks like a regexp:

@timing
def with_regex(data: list[str]):
    regexp = r"^[a-z0-9A-Z][a-z0-9A-Z\-_. ():/,@\[\]]*$"
    re_name = re.compile(regexp)
    for item in data:
        if re.fullmatch(re_name, item):
            continue

1.4. Second idea: set of characters + loop on each character

Another way to do that, would be by creating a set of characters that are valid, and then loop over each character of the string, and check if the character belongs to that set:

@timing
def with_sets(data: list[str]):
    first_char = set(string.ascii_lowercase
        + string.ascii_uppercase
        + string.digits)
    second_char = set(string.ascii_lowercase
        + string.ascii_uppercase
        + string.digits +
         "\\-_. ():/,@[]")
    for item in data:
        if item[0] not in first_char:
            continue
        for i in item[1:]:
            if i not in second_char:
                break

1.5. Third idea: compare the numeric value of the char in the ascii table

ASCII has a peculiarity: letters and numbers are part of a range of numbers, and if we convert a character to its numerical representation, we can then compare it against the range that matches our character classes:

@timing
def with_ord(data: list[str]):
    for item in data:
        first_char = ord(item[0])
        if not (
            (48 <= first_char <= 57)
            or (65 <= first_char <= 90)
            or (97 <= first_char <= 122)
            ):
            continue
        for c in item[1:]:
            i = ord(c)
            if not (
                (44 <= i <= 58)
                or (64 <= i <= 93)
                or (97 <= i <= 122)
                or (i == 32)
                or (i == 40)
                or (i == 41)
            ):
                break

1.6. Fourth idea: use strings as bytes

Instead of transforming a character to a numerical value, we can make use of all the operations on bytes in the standard library.

In particular, .translate which allows us to remove all characters not matching our classes, and checking if any characters remain:

@timing
def with_bytes(data: list[str]):
    first_char = set(
        bytes(string.ascii_lowercase, "ascii")
        + bytes(string.ascii_uppercase, "ascii")
        + bytes(string.digits, "ascii")
    )
    first_char = bytes(first_char)
    second_char = set(
        bytes(string.ascii_lowercase, "ascii")
        + bytes(string.ascii_uppercase, "ascii")
        + bytes(string.digits, "ascii")
        + "\\-_. ():/,@[]".encode("ascii", "ignore")
    )
    second_char = bytes(second_char)
    for item in data:
        if item[0].encode("ascii", "ignore").translate(None, first_char) != b"":
            continue
        if item.encode("ascii", "ignore").translate(None, second_char) != b"":
            continue

1.7. A way to run everything

I run those multiple times in a row, and print the results:

if __name__ == "__main__":
    for _ in range(0, 10):
        data = generate(10_000_000)

        with_regex(data)
        with_sets(data)
        with_ord(data)
        with_bytes(data)
    print_timings()

2. Performance

Results in milliseconds (ms) with standard deviation:

Solution

Python 3.10 (ms±sd)

Python 3.11 (ms±sd)

with_regex

9 362±202

9 535±128

with_sets

19 505±333

14 460±190

with_ord

87 463±1 850

43 629±544

with_bytes

4 208±81

3 571±61

You can run this yourself with bench.py.