Hacker News new | past | comments | ask | show | jobs | submit login

The key quote here is:

"Regular expressions are one of computer science's shining examples of how using good theory leads to good programs ..."

"Today, regular expressions have also become a shining example of how ignoring good theory leads to bad programs. The regular expression implementations used by today's popular tools are significantly slower than the ones used in many of those thirty-year-old Unix tools."

The alarming thing is that regex are supposed to be compiled before use, and the class of expressions that need quadratic or exponential time is distinct from the class that needs linear time. Why don't these popular, widely-used regex implementations perform any optimizations?




So you have the classic NFA and DFA engines, and these more modern NSH (non-deterministic space heater) engines.

At least they keep you warm. Sometimes.

How can it not be a feature that you can heat more space the more spaces you feed it?


> How can it not be a feature that you can heat more space the more spaces you feed it?

You just made my day. Thank you.


Unfortunately, there is a hint as to how this has happened:

"This strategy is no longer practical: users have come to rely on backreferences for at least occasional use, and backreferences are part of the POSIX standard for regular expressions."

What better excuse is there for a poor implementation than standards compliance? In many ways, using regex with backtracking by default is like programming in Lisp without tail-call optimizations. If the POSIX standard is going to require backreferences, then it should also require a non-backtracking implementation for regular expressions without backreferences, just like the Scheme specification requires implementations to support tail-call optimization.

The comparison is valid because they both can create a quadratic runtime from a linear time algorithm.


> If the POSIX standard is going to require backreferences, then it should also require a non-backtracking implementation for regular expressions without backreferences

At least in some regexp engines, this is possible. There is a concept of possessive (as opposed to reluctant, or as it's often called, "non-greedy") quantifiers. Given the string "abb", consider the following regexes:

/[ab]+b/ -- matches "abb"

/[ab]+?b/ -- matches "ab"

/[ab]++b/ -- DOES NOT MATCH!! (Because there is no back-tracking.)

More info: http://www.regular-expressions.info/possessive.html

One regexp engine which implements this (see section 4): https://raw.githubusercontent.com/k-takata/Onigmo/master/doc... -- this is the engine used by modern Ruby versions.


Yes, I recall that he talks about the multi-engine approach with respect to both PCRE and RE2 somewhere in these articles:

https://swtch.com/~rsc/regexp/

Sorry I don't have the exact reference handy. Apparently the author of PCRE tried to implement an NFA/DFA engine for some regexes as well. And I think RE2 might also have different strategies for different regexes as well, but I forget the details (it might be related to capturing).


FWIW, the conversion from NFA (non-deterministic, i.e. backtracking) to DFA (deterministic, linear time) can take exponential space. So there's another avenue for DDOS; it's a lot harder to exploit, though, because it requires the attacker to control the input (i.e. regular expression) to the NFA->DFA transformation, rather than merely provide a string that takes a long time for an NFA to recognize.


I'm actually not aware of any popular DFA based engines that suffer from this vulnerability. grep and RE2, for example, build the DFA lazily and cap the size of the DFA such that matching is still linear time even if generating the full DFA would take exponential space. (This is because at most one DFA state is generated for each byte in the input, so technically, you only ever need space for one or two states. In practice, you give a little more room to avoid constantly recomputing states, but it's still bounded.)


Random side note... Recognized your name, we use your TOML go package in a bunch of our tools. Thanks and keep up the great work!


There is a misunderstanding here -- NFA simulation is NOT backtracking. A NFA can be simulated in time LINEAR to the input length (holding the regex length as a constant). A DFA state corresponds to a set of NFA states. It is an additional optimization to convert NFA to DFA, but it just prevents doing repeated calculation -- it doesn't affect the computational complexity. Cox's articles address this and are quite lucid.

The problem is that the friedl O'Reilly book uses the terms NFA and DFA in a completely wrong and boneheaded way. I was confused by that book for awhile.


Only if you want to model it with every combination of states from the nfa getting one from the dfa. If you don't mind describing the current state as a list of potential states from the nfa, the conversion is linear space.


Usually there's no need to convert NFA to DFA. NFA can be simulated directly, i.e. by using bit vector instead of integer for state accounting. Also, no backtracking is necessary to simulate an NFA.


The name regular expression seems meant to evoke regular languages, whose significance is that they can be recognized in a very straightforward way (by finite state automata) without pathological gotcha cases like this.

Perhaps we ought to call them irregular expressions - this kind of behavior is the literal antithesis of what regular means in automata/parsing theory.


It's a little unfair to complain that they're slower than 30 year old regex engines when the old regex engines were so feature limited that they were nearly useless.


They're only missing one feature: back references. This feature is not needed for the majority of regular expressions, so the 30 year old engines are actually remarkably useful and don't have these pathologies. And actually it was the introduction of back references (in Perl) that causes you to have to implement backtracking regular expression engines.


In addition to back references, Perl also has forward and back assertions, IF/ELSE, and recursion IIRC. I'm not sure if anyone actually USES those features, but they are there.

Python adopted back references as well as forward and back assertions, but not the other stuff.

Perl really turned regular expressions into something else entirely...


Yes, I forgot about assertions. I've never _really_ used them, but I remember they were useful once or twice when wrangling with a particularly ugly format problem.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: