Stop abusing regexps, use parser combinators instead
Regular expressions are very convenient when used interactively. Yet, they suffer too many problems when used as part of a program to check/validate some inputs: they are “write-only”, very hard to debug, hard to fix, and they give some of the worst error messages ever imagined.
Parser combinators, on the other hand, offer more power, better error messages, ease of reading and debugging, and ease of composition.
Let’s see how to use them in Python with an example.
1. Activity
Let’s imagine that you have a filename with some constrains to check:
-
it should start with
APP-
; -
followed by the year, month and date:
2020-12-16
; -
followed by
-Free_text
; -
followed by the file extension
.app
or.app2
.
Furthermore, the following conditions must be met:
-
the date must be a valid one;
-
the free text section must contain at last 3 characters, and at most 42;
-
the free text section must not contain any spaces.
Here are some valid examples:
-
APP-1900-01-01-fre.app
-
APP-1999-12-31-école_sur_mesure.app2
-
APP-2015-01-01-this_sentence_is_about_四十二_character_long.app
Here are some invalid examples:
-
APP-1900-01-1-fre.app
-
APP-1999-55-55-What is that.app
-
APP-2015-10-10_this_is_not_allowed.app2
-
APP-2000-10-14-this_is_too_long_to_possibly_be_allowed_in_this_program_am_i_right.app
This activity is adapted from a real piece of code I saw.
2. Contract
A function whose job is to validate a given filename: it will return an object with the parsed data if successful, otherwise it will raise an exception.
from dataclasses import dataclass
from datetime import datetime
@dataclass
class ValidFilename:
date: datetime
free_text: str
def validate(filename: str) -> ValidFilename:
return ValidFilename(datetime.now(), "")
3. Unit-tests
Let’s convert our previous examples into actual unit-tests:
class TestExample(unittest.TestCase):
def test_valid_filenames(self):
cases = [("1900-01-01", "fre", "app"),
("1999-12-31", "école_sur_mesure", "app2"),
("2015-01-01", "this_sentence_is_about_四十二_character_long", "app")]
for case in cases:
date = case[0]
free_text = case[1]
extension = case[2]
filename = f"APP-{date}-{free_text}.{extension}"
output = validate(filename)
self.assertEqual(parse(date), output.date)
self.assertEqual(free_text, output.free_text)
def test_invalid_filenames(self):
cases = [("1900-01-1", "fre", "app"),
("1999-55-55", "What is that", "app"),
("2000-10-14", "this_is_too_long_to_possibly_be_allowed_in_this_program_am_i_right", "app")]
for case in cases:
date = case[0]
free_text = case[1]
extension = case[2]
filename = f"APP-{date}-{free_text}.{extension}"
with self.assertRaises(Exception) as context:
validate(filename)
print(context.exception)
4. With regular expressions
Regexps aren’t that easy to write, so I recommend using an online tool to help with this.
def validate_regexps(filename: str) -> ValidFilename:
import re
pattern = "^APP-(\d{4}-\d{2}-\d{2})-(.{3,42})\.(app|app2)$" # a bug is hiding there
p = re.compile(pattern)
matches = re.match(p, filename)
if matches is None:
raise Exception(f"filename '{filename}' does not match '{pattern}'")
return ValidFilename(parse(matches.group(1)), matches.group(2))
def validate(filename: str) -> ValidFilename:
return validate_regexps(filename)
As you can see, each time the regexps does not match, we cannot really know why. The error message is always the same, and it’s up to the user to find the issue.
5. With parser combinators
Just like we used a regexps library (re
), we will use a parser combinators library:
parsy.
def validate_parsy(filename: str) -> ValidFilename:
from parsy import string, seq, test_char, alt, ParseError
digit = test_char(lambda c: c.isdigit(), 'a digit')
year = digit.times(4).map(lambda l: int(''.join(l))).desc("4 digits year")
month = digit.times(2).map(lambda l: int(''.join(l))).desc("2 digits month")
day = digit.times(2).map(lambda l: int(''.join(l))).desc("2 digits day")
dash = string("-")
fulldate = seq(year, dash >> month, dash >> day).combine(datetime)
valid_char = test_char(lambda c: not c.isspace() and c != '.', 'any char but space or .')
free_textp = valid_char.times(3, 42).desc("Free text")
extension = alt(string(".app2"), string(".app")).desc(".app2 or .app")
p = seq(string("APP-") >> fulldate,
string("-") >> free_textp << extension).combine(lambda a, b: [a, ''.join(b)])
try:
r = p.parse(filename)
return ValidFilename(r[0], r[1])
except ParseError as e:
raise Exception(e)
The function to do that is slightly longer, but is made of multiple sub-parsers that can be re-used elsewhere in the program.
6. Error messages
Error messages are a lot better, and here is an example: for filename
APP-1900-01-1-fre.app
, the regexps parser will display the following:
filename 'APP-1900-01-1-fre.app' does not match '^APP-(\d{4}-\d{2}-\d{2})-(.{3,42})\.(app|app2)$'
while the combinators parser will display the following:
expected '2 digits day' at 0:12
This more accurately tells you what the issue is, and where it is, instead of having to parse the regexps yourself, and try to see which part is wrong.
7. What is a parser combinator?
A parser combinator is a special kind of function that is composable, and that agreegates the parsed data along with the remaining string to parse from one function to another. They can be chained one after the other (they are what is called an applicative functor (archive)).
I strongly recommend Understanding Parser Combinators (archive), where Scott Wlaschin implements a parser combinator library while explaining what this means.
Alternatively, his video presentation covers the same topic if you prefer that format.
8. Conclusion
Next time you reach for regexps to parse something in a program, remember to give parser combinators a try.