I was going to say there was a faster way than regex, but I saw you were not act...

joshma · on April 20, 2016

Yeah, I'm not too happy about how we're using regex inside a groovy script for the final matching - when we scale this to support larger matches we'll likely hit memory issues trying to parse the entire string at once, and I've been playing around with the idea of streaming the bases into a state machine. Is that similar to what you were referring to?

Do you have a way of generating state machines, or are they hand coded? Is it a perf optimization, or does it help integrating ___domain-specific logic?

tmaly · on April 20, 2016

I initially started with this article when I first looked into the issue of regex performance

https://swtch.com/~rsc/regexp/regexp1.html

To generate state machines

For C there is re2c http://re2c.org/

For Go there is https://github.com/opennota/re2dfa

My patterns do not change that often, so I end up hand coding them.

If you are looking to continue using regex for matching, I would look at using the TCL regex library as it has some optimizations that other regex libraries do not have.

vorg · on April 20, 2016

Apache Groovy was really just intended for scripting anyway, and not meant to be scaled. Someone in their development group added static compilation and Android executables in Groovy 2.x, but they aren't being used by many people. Best just to use Groovy for its original purpose, for scripting Java classes or as a DSL in Gradle, and convert to Java (or Scala, Kotlin, etc) if you need to scale anything.

gliese1337 · on April 20, 2016

Can't speak for tmaly, but you might look at things like the Xerox Finite State Tools, Helsinki Finite State Tools, or KLEENE. They were developed with linguistics in mind, but are in fact general-purpose tools for generating and editing string-matching state machines.

It seems like every time I learn something new about bioinformatics, it just has more and more overlap with computational linguistics!

malisper · on April 20, 2016

What were some of the query features that caused you to choose Elasticsearch over pg_trgm, because pg_trgm supports regular expressions out of the box.

joshma · on April 20, 2016

We use ES for the rest of benchling search already, so that was the main reason why we stuck with ES (ngram indexing is pretty standard there).

As for why we were using ES for the rest of search: we were using it for things like language stemming, matching terms with "slop", things like {'minimum_should_match': '3<67%'} (require exact matches if 3 or less terms, 67% if more than that), searching _all to match anything in the document, etc etc. I think a bunch of these (maybe even all) you can do in PG, but it was way easier to get going with ES.

ES is also distributed, which makes scaling up and doing maintenance a lot easier - a bunch of times we just threw new nodes at it and remove old nodes that had gone bad.

tdeck · on April 21, 2016

I don't quite understand this comment. Isn't a regular expression just a description of a custom finite state machine? Any fast regex implementation will first compile the pattern into a custom DFA before matching it.

burntsushi · on April 21, 2016

By my count, most "regex" engines implement things more powerful than regular expressions. I'd say the majority of regex engines use backtracking instead of a DFA. Backtracking does tend to be slower in my experience, but PCRE2's JIT is definitely competitive with even the fastest DFA engines in my experience.