Textbook regex algorithms assume that you're trying to determine whether a string matches an expression using a standard set of operators. Perl regular expressions have to match text to subexpressions, rather than just return a yes/no answer, and have a greatly expanded set of operators, some of which are completely incompatible with the usual NFA and DFA algorithms. It even lets you drop arbitrary perl code into the middle of a regex. You can easily write a perl regex that is NP-complete to match. You can't do that with the simplified regex language that textbooks talk about.
The article offers a simple solution here: if the regex fits the model (no backtracking, no arbitrary code, etc..._), use the linear-scaling algorithm described. Else, use the slow, exponential one. Perl (according to the article) is not doing this.
I have to say the reason I spent so much time using perl is because of its ease in handling text and such strength in regular expressions, which are fascinating. I now use regular expressions when ever I can, and I think in most cases it saves you a lot of time.
I still remember when I was young and starting to work with Perl, I found this zip file with the whole bible represented as (I think) 1 html file, with chapter / verse tags. So I posed this question to an online friend who was then studying in Germany, he had built many cgi sites (he made our clan page for Earth2025 if anybody still remembers that game) with perl and was very good at it. About a day later he icq'ed me this 1 page perl script, which created file index positions for all the chapters and verses. This was such an ingenious solution which made seeking a particular chapter/ verse range extremely fast, and all done with text files and perl.
Sounds about right. Wikipedia has more information on regular languages, which computer scientists know a lot about, and of which Perl regex is a superset:
"some of which are completely incompatible with the usual NFA and DFA algorithms" - It would be nice if you pointed to some article/resource giving more info on this.
Well here's a starter for you. Using enhanced regex syntax you can match a language containing strings of prime length. On the other hand (using the pumping lemma[1]) you can prove that that set is not regular, hence not detectable by a standard regex.
It would appear that Perl and the others were based on a from-scratch reimplementation of the Eighth Edition regex library. To quote:
While writing the text editor sam [6] in the early 1980s, Rob Pike wrote a new regular expression implementation, which Dave Presotto extracted into a library that appeared in the Eighth Edition. Pike's implementation incorporated submatch tracking into an efficient NFA simulation but, like the rest of the Eighth Edition source, was not widely distributed. Pike himself did not realize that his technique was anything new. Henry Spencer reimplemented the Eighth Edition library interface from scratch, but using backtracking, and released his implementation into the public domain. It became very widely used, eventually serving as the basis for the slow regular expression implementations mentioned earlier: Perl, PCRE, Python, and so on.
Same question here. I didn't realize that they didn't. regular expression NFA and DFA algorithms are in every compiler book out there. I find it hard to believe that guido et al. didn't know about these algorithms.
Yeah, jimrandomh's comment basically explains it. Perl regexp's have features that can't be implemented by a DFA, like backreferences and arbitrary code evaluation. (I believe you can implement capturing subgroups, you'd just need to annotate the states with a marker for the current position in the string.) Those features are used often enough that you can't just leave them out.
I see no reason why popular languages couldn't implement both algorithms, though, and then select the backtracking one only if the regexp contains backreferences or expressions. It's easy enough to tell if a regexp uses these features, just checking for their existence when the regexp is compiled. Then you could use the fast algorithm when possible and the featureful one when not.
I'm curious why they don't do some sort of cost-estimation when doing the regex compile. Oftentimes you use a regex only a few times in the life of the application, and it'd be nice if maybe it would build the regex automaton with some time-budget cap (say, N states created) and fallback to PCRE until its fully built.
(Backseat nitpicker here. And just thinking about it seems deliciously hairy.)