It doesn't actually take that many lines of code to implement a linear time NFA engine. Most of the code is actually in the regex compiler. That is, there are only a few actual "instructions" or node types in a regex engine (alternation, concatenation, etc.). The rest is just compiling the bizarre syntax to a those nodes/instructions. (And dealing with Unicode if you need that.)
The whole awk implementation is 958 lines, so it can't be that bad. It supports a fairly featureful regex language (though no capturing AFAICT).
Hm it seems to be a copy of the GNU libc regex engine. And coreutils, grep, and awk also package the same engine in user space! That's really annoying.
But yes it looks like it uses a struct re_dfa_t for matching, and some special flags for backreferences? That seems to indicate that they are using linear time maybe with a fallback? Hm.
I think the general turn of events was that libc supported BRE at first, which was implemented using the DFA algorithm. Then Perl implemented a backtracking engine and Perl syntax for REs, and Python/Ruby/etc. wanted to be compatible with Perl, so they also implemented backtracking as well. Because Perl has other features like forward and back assertions that require backtracking.
And then perhaps libc got EREs with backreferences which they bolted onto the existing infrastructure?
Anyway, thanks for the pointer... I may look into these more in the future.
musl libc also uses the TRE regex engine, which apparently uses the DFA algorithm.
The whole awk implementation is 958 lines, so it can't be that bad. It supports a fairly featureful regex language (though no capturing AFAICT).