Regular Expression Matching Can Be Simple And Fast

jimrandomh · on Feb 5, 2009

Textbook regex algorithms assume that you're trying to determine whether a string matches an expression using a standard set of operators. Perl regular expressions have to match text to subexpressions, rather than just return a yes/no answer, and have a greatly expanded set of operators, some of which are completely incompatible with the usual NFA and DFA algorithms. It even lets you drop arbitrary perl code into the middle of a regex. You can easily write a perl regex that is NP-complete to match. You can't do that with the simplified regex language that textbooks talk about.

tl · on Feb 5, 2009

The article offers a simple solution here: if the regex fits the model (no backtracking, no arbitrary code, etc..._), use the linear-scaling algorithm described. Else, use the slow, exponential one. Perl (according to the article) is not doing this.

light3 · on Feb 5, 2009

I have to say the reason I spent so much time using perl is because of its ease in handling text and such strength in regular expressions, which are fascinating. I now use regular expressions when ever I can, and I think in most cases it saves you a lot of time.

I still remember when I was young and starting to work with Perl, I found this zip file with the whole bible represented as (I think) 1 html file, with chapter / verse tags. So I posed this question to an online friend who was then studying in Germany, he had built many cgi sites (he made our clan page for Earth2025 if anybody still remembers that game) with perl and was very good at it. About a day later he icq'ed me this 1 page perl script, which created file index positions for all the chapters and verses. This was such an ingenious solution which made seeking a particular chapter/ verse range extremely fast, and all done with text files and perl.

jacobscott · on Feb 5, 2009

Sounds about right. Wikipedia has more information on regular languages, which computer scientists know a lot about, and of which Perl regex is a superset:

http://en.wikipedia.org/wiki/Regular_language

alexk · on Feb 5, 2009

"some of which are completely incompatible with the usual NFA and DFA algorithms" - It would be nice if you pointed to some article/resource giving more info on this.

tome · on Feb 5, 2009

Well here's a starter for you. Using enhanced regex syntax you can match a language containing strings of prime length. On the other hand (using the pumping lemma[1]) you can prove that that set is not regular, hence not detectable by a standard regex.

[1] http://books.google.com/books?id=MDQ_K7-z2AMC&pg=PA47...

llimllib · on Feb 5, 2009

Sipser's book is great for understanding this stuff, IMHO: http://books.google.com/books?id=eRYFAAAACAAJ&dq=sipser+...

harpastum · on Feb 5, 2009

This reminds me of a very interesting article on the same vein from the great (now sadly defunct) Ridiculous Fish

The Treacherous Optimization http://ridiculousfish.com/blog/archives/2006/05/30/old-age-a...

ambition · on Feb 5, 2009

That's a good article. Ridiculous Fish is awesome. You should submit it!

chime · on Feb 5, 2009

So why don't Perl, Ruby etc. use Thompson NFA?

jff · on Feb 5, 2009

It would appear that Perl and the others were based on a from-scratch reimplementation of the Eighth Edition regex library. To quote:

While writing the text editor sam [6] in the early 1980s, Rob Pike wrote a new regular expression implementation, which Dave Presotto extracted into a library that appeared in the Eighth Edition. Pike's implementation incorporated submatch tracking into an efficient NFA simulation but, like the rest of the Eighth Edition source, was not widely distributed. Pike himself did not realize that his technique was anything new. Henry Spencer reimplemented the Eighth Edition library interface from scratch, but using backtracking, and released his implementation into the public domain. It became very widely used, eventually serving as the basis for the slow regular expression implementations mentioned earlier: Perl, PCRE, Python, and so on.

chorny · on April 19, 2009

There are modules on CPAN for different types of regex engines.

newt0311 · on Feb 5, 2009

Same question here. I didn't realize that they didn't. regular expression NFA and DFA algorithms are in every compiler book out there. I find it hard to believe that guido et al. didn't know about these algorithms.

nostrademons · on Feb 5, 2009

Yeah, jimrandomh's comment basically explains it. Perl regexp's have features that can't be implemented by a DFA, like backreferences and arbitrary code evaluation. (I believe you can implement capturing subgroups, you'd just need to annotate the states with a marker for the current position in the string.) Those features are used often enough that you can't just leave them out.

I see no reason why popular languages couldn't implement both algorithms, though, and then select the backtracking one only if the regexp contains backreferences or expressions. It's easy enough to tell if a regexp uses these features, just checking for their existence when the regexp is compiled. Then you could use the fast algorithm when possible and the featureful one when not.

newt0311 · on Feb 5, 2009

Nope. Backref and friends can be implemented on top of DFAs. Grep offers an example of such an implementation.

michaelneale · on Feb 5, 2009

I am guessing the comment by jimrandomh explains why. to be perl compatible you need a lot more.

litewulf · on Feb 5, 2009

I'm curious why they don't do some sort of cost-estimation when doing the regex compile. Oftentimes you use a regex only a few times in the life of the application, and it'd be nice if maybe it would build the regex automaton with some time-budget cap (say, N states created) and fallback to PCRE until its fully built.

(Backseat nitpicker here. And just thinking about it seems deliciously hairy.)