Stop abusing regexps, use parser combinators instead

Regular expressions are very convenient when used interactively. Yet, they suffer too many problems when used as part of a program to check/validate some inputs: they are “write-only”, very hard to debug, hard to fix, and they give some of the worst error messages ever imagined.

Parser combinators, on the other hand, offer more power, better error messages, ease of reading and debugging, and ease of composition.

Let’s see how to use them in Python with an example.

1. Activity

Let’s imagine that you have a filename with some constrains to check:

it should start with APP-;
followed by the year, month and date: 2020-12-16;
followed by -Free_text;
followed by the file extension .app or .app2.

Furthermore, the following conditions must be met:

the date must be a valid one;
the free text section must contain at last 3 characters, and at most 42;
the free text section must not contain any spaces.

Here are some valid examples:

APP-1900-01-01-fre.app
APP-1999-12-31-école_sur_mesure.app2
APP-2015-01-01-this_sentence_is_about_四十二_character_long.app

Here are some invalid examples:

APP-1900-01-1-fre.app
APP-1999-55-55-What is that.app
APP-2015-10-10_this_is_not_allowed.app2
APP-2000-10-14-this_is_too_long_to_possibly_be_allowed_in_this_program_am_i_right.app

This activity is adapted from a real piece of code I saw.

2. Contract

A function whose job is to validate a given filename: it will return an object with the parsed data if successful, otherwise it will raise an exception.

Validate function

from dataclasses import dataclass
from datetime import datetime

@dataclass
class ValidFilename:
    date: datetime
    free_text: str

def validate(filename: str) -> ValidFilename:
    return ValidFilename(datetime.now(), "")

3. Unit-tests

Let’s convert our previous examples into actual unit-tests:

Python 3.8+ unit-tests

class TestExample(unittest.TestCase):
    def test_valid_filenames(self):
        cases = [("1900-01-01", "fre", "app"),
                 ("1999-12-31", "école_sur_mesure", "app2"),
                 ("2015-01-01", "this_sentence_is_about_四十二_character_long", "app")]
        for case in cases:
            date = case[0]
            free_text = case[1]
            extension = case[2]
            filename = f"APP-{date}-{free_text}.{extension}"

            output = validate(filename)

            self.assertEqual(parse(date), output.date)
            self.assertEqual(free_text, output.free_text)

    def test_invalid_filenames(self):
        cases = [("1900-01-1", "fre", "app"),
                 ("1999-55-55", "What is that", "app"),
                 ("2000-10-14", "this_is_too_long_to_possibly_be_allowed_in_this_program_am_i_right", "app")]
        for case in cases:
            date = case[0]
            free_text = case[1]
            extension = case[2]
            filename = f"APP-{date}-{free_text}.{extension}"

            with self.assertRaises(Exception) as context:
                validate(filename)
            print(context.exception)

4. With regular expressions

Regexps aren’t that easy to write, so I recommend using an online tool to help with this.

Regexps based solution

def validate_regexps(filename: str) -> ValidFilename:
    import re
    pattern = "^APP-(\d{4}-\d{2}-\d{2})-(.{3,42})\.(app|app2)$" # a bug is hiding there
    p = re.compile(pattern)
    matches = re.match(p, filename)

    if matches is None:
        raise Exception(f"filename '{filename}' does not match '{pattern}'")

    return ValidFilename(parse(matches.group(1)), matches.group(2))


def validate(filename: str) -> ValidFilename:
    return validate_regexps(filename)

As you can see, each time the regexps does not match, we cannot really know why. The error message is always the same, and it’s up to the user to find the issue.

5. With parser combinators

Just like we used a regexps library (re), we will use a parser combinators library: parsy.

Parser combinators based solution

def validate_parsy(filename: str) -> ValidFilename:
    from parsy import string, seq, test_char, alt, ParseError
    digit = test_char(lambda c: c.isdigit(), 'a digit')
    year = digit.times(4).map(lambda l: int(''.join(l))).desc("4 digits year")
    month = digit.times(2).map(lambda l: int(''.join(l))).desc("2 digits month")
    day = digit.times(2).map(lambda l: int(''.join(l))).desc("2 digits day")
    dash = string("-")

    fulldate = seq(year, dash >> month, dash >> day).combine(datetime)

    valid_char = test_char(lambda c: not c.isspace() and c != '.', 'any char but space or .')
    free_textp = valid_char.times(3, 42).desc("Free text")

    extension = alt(string(".app2"), string(".app")).desc(".app2 or .app")

    p = seq(string("APP-") >> fulldate,
            string("-") >> free_textp << extension).combine(lambda a, b: [a, ''.join(b)])

    try:
        r = p.parse(filename)
        return ValidFilename(r[0], r[1])
    except ParseError as e:
        raise Exception(e)

The function to do that is slightly longer, but is made of multiple sub-parsers that can be re-used elsewhere in the program.

6. Error messages

Error messages are a lot better, and here is an example: for filename APP-1900-01-1-fre.app, the regexps parser will display the following:

filename 'APP-1900-01-1-fre.app' does not match '^APP-(\d{4}-\d{2}-\d{2})-(.{3,42})\.(app|app2)$'

while the combinators parser will display the following:

expected '2 digits day' at 0:12

This more accurately tells you what the issue is, and where it is, instead of having to parse the regexps yourself, and try to see which part is wrong.

7. What is a parser combinator?

A parser combinator is a special kind of function that is composable, and that agreegates the parsed data along with the remaining string to parse from one function to another. They can be chained one after the other (they are what is called an applicative functor (archive)).

I strongly recommend Understanding Parser Combinators (archive), where Scott Wlaschin implements a parser combinator library while explaining what this means.

Alternatively, his video presentation covers the same topic if you prefer that format.

8. Conclusion

Next time you reach for regexps to parse something in a program, remember to give parser combinators a try.