GNU grep is 10x faster than Mac grep

pooriaazimi · on Nov 28, 2012

I'm not trying to start a theological war about grep/ack here, I'm just mentioning it in case someone hasn't heard about 'ack' before and they (like me) might find it extremely useful: http://betterthangrep.com

It's grep, just better. It highlights the selected text, it shows which files, and in what line the text was found (and uses vivid colors so you can distinguish them easily), ignores .git and .hg directories (among others, that shouldn't be searched) by default, you can tell it to search, for example for only `--cpp` or `--objc` or `--ruby` or `--text` files (with a flag, not a filename pattern), and many many other neat features that I'm sure grep has, but you have to remember and memorize them. ack has sensible defaults.

Why ack? http://betterthangrep.com/why-ack/

manpage: http://betterthangrep.com/documentation/

Oh, and ack is written in perl and doesn't require admin privileges to install.

ComputerGuru · on Nov 28, 2012

Do you know of any C ports of ack? Ack is beautiful and productive, but nowhere near as fast as grep (orders of magnitude slower, in fact).

    gfind . -type f -exec grep -i mbr {} \; >| /dev/null  
    1.10s user 0.81s system 90% cpu 2.113 total

    gfind . -type f -exec ack -i mbr {} \; >| /dev/null  
    24.34s user 4.17s system 96% cpu 29.678 total

(Yes, I know about the flag to search recursively. This is the most fair comparison.)

AngryParsley · on Nov 28, 2012

I wrote a mostly-clone of Ack in C: https://github.com/ggreer/the_silver_searcher . Output format and most flags are the same. Besides the speed, most users won't notice a difference.

I spared no effort in optimizing. Pthreads, mmap(), boyer-moore-horspool strstr, it's all there. Searching my ~/code (5.2GB of stuff), I get this:

    ag blahblahblah  1.93s user 3.54s system 313% cpu 1.749 total

    ack blahblahblah  9.75s user 2.79s system 98% cpu 12.690 total

Both programs ignore a lot of extraneous files by default (hidden files, binary files, stuff in .gitignore, etc). The real amount of data searched is closer to 500MB.

lloeki · on Nov 28, 2012

Looks good, but from the doc I can't tell if it supports the second most useful feature of ack, that is scoped search:

    ack --ruby --js foo_bar

will search only ruby and javascript files, which means .rb+.erb+.rhtml+.js+...

Also exclusion with --no-* is very useful (especially --no-sql).

This is markedly different from 'simply' ignoring irrelevant files, besides the fact that it does not need a 'project' to work (ack --ruby foo_func $(bundle show bar_gem)).

The better part being it is extendable so that I can create --stylesheets covering css+sass+scss+less, or add say .builder to --ruby.

(BTW, love the name/command)

kamaal · on Nov 28, 2012

Only that, Ack's core strength over time will always evolve and depend on Perl's regular expression and text processing powers.

So re writing this in C will fundamentally mean endlessly growing a language which will look similar to the Perl implementation. Or a Perl DSL.

Not that its a bad thing, I find it interesting though. I would say you better start with a specification.

AngryParsley · on Nov 28, 2012

Ag supports the same regexes as Ack. I use the PCRE library. I only call pcre_study once, and I use the new PCRE-JIT[1] on systems where it's available. These tweaks add up to a 3-5x speedup over Ack when regex-matching.

1. http://sljit.sourceforge.net/pcre.html

btilly · on Nov 28, 2012

If you use PCRE, you do NOT support the same regexes as Ack.

"Perl Compatible" isn't really Perl compatible, see http://en.wikipedia.org/wiki/PCRE for details.

AngryParsley · on Nov 28, 2012

Yes, there are a few edge cases, but hardly anyone uses those features. In fact, 90% of the time, most people seem to use literal string matching.

ch0wn · on Nov 28, 2012

This looks fantastic. Could you by any chance update your PPA for quantal in the future?

EDIT: The last precise build works just fine, though.

fsiefken · on Nov 28, 2012

Thanks so much, I was staying with grep precisely because of the performance and perl dependency of Ack. Does the silver searcher compile on win32 as well?

npongratz · on Nov 28, 2012

Per the README on the github page, instructions for building ag for Windows are here:

https://github.com/ggreer/the_silver_searcher/wiki/Windows

The author forewarns that "[i]t's complicated".

AngryParsley · on Nov 28, 2012

Since I added pthreads, there's no chance that it builds on Windows anymore. I don't have a Windows machine or VM to test stuff out on. Patches welcome, though!

zurn · on Nov 30, 2012

Did you benchmark read() vs mmap()? Most tools seem to go with read() for grep-like io patterns.

In fact looks like GNU grep has --mmap switch and it's a little bit faster in the simple case than default on my Ubuntu system. But -i makes mmap slower. Maybe GNU grep just avoids mmap because of error handling (you get a segfault/bus error instead of an io error return when things go wrong).

Someone · on Nov 28, 2012

Feel free to correct me as I am too lazy to test this, but I do not think that is the most fair comparison. I would do

  ack -i mbr > /dev/null

I think that starts up perl once, not once per file. If so timing should be much better.

ack searches recursively by default; I don't think it can search non-recursively (why would you want to? That is what grep is for)

Also: try comparing grep and ack in a directory tree that has 'garbage' such as .svn or .git directories or .o files.

dmuc · on Nov 28, 2012

No, it's not a fair comparison.

For every single file found you start ack again. You compare startup times here. Ack is so slow here because it's a perl script. For every single file you start the perl interpreter, and the perl interpreter compiles and interpretes ack every time.

Evbn · on Nov 28, 2012

It is totally not fair that perl has a slow startup time and doesn't run as a daemon. Grep is cheating by not pulling in a huge runtime.

mturmon · on Nov 28, 2012

I think the point stands.

First, without knowing the makeup of files he has, you can't tell how much a corner case this is. It could be 100K small files or 10 large ones. Few care about runtimes for small files, but many care about runtimes for large ones.

Also, and probably more importantly, you'd use ack differently in a recursive-find situation. You just "ack" from the top of the tree. The perl interpreter starts only once.

I don't think this is a useful benchmark for typical uses of ack.

pooriaazimi · on Nov 28, 2012

I'm not on a machine that I can use freely to test, but maybe you shouldn't use -exec? use xargs instead. I don't remember it being slower than grep that much...

caioariede · on Nov 28, 2012

For myself I usually use this function below that is useful if you do a lot of code grep.

    function g! { grep -nr --exclude=.git --exclude=.hg --exclude=.svn --include="*.$1" "$2" ${3-"."} --colour; }

And then:

    g! py "some python code"

https://coderwall.com/p/uhzc0a

ZeroGravitas · on Nov 28, 2012

You can tweak some git grep config settings and get the ack UI and because it's git grep it's got most of the code conveniences too as well as the speed.

http://travisjeffery.com/b/2012/02/search-a-git-repo-like-a-...

travisjeffery · on Nov 28, 2012

Hey! Glad you found my post useful!

scrumper · on Nov 28, 2012

So did I. It's excellent.

dchest · on Nov 28, 2012

Now, if we need something 3x-5x faster than Ack... https://github.com/ggreer/the_silver_searcher

JonnieCache · on Nov 28, 2012

This comes highly recommended.

wting · on Nov 28, 2012

> It highlights the selected text,

    grep --color

> it shows which files, and in what line the text was found (and uses vivid colors so you can distinguish them easily),

    grep -rn --color pattern ./files/
    files/foo.sh:123:    echo "Look at the floral pattern on this dress!"

> ignores .git and .hg directories (among others, that shouldn't be searched) by default,

    git --exclude=.git --exclude=.hg --exclude=.svn

> you can tell it to search, for example for only `--cpp` or `--objc` or `--ruby` or `--text` files (with a flag, not a filename pattern),

You would use `find` in conjunction with `grep`. "Art of Unix Programming", modularity, and all that jazz. Presumably you would just modify your own grep alias or define a function to avoid retyping. The end result pretty much looks like my grep alias:

    alias grep='grep -Ein --color --exclude=.git --exclude=.hg --exclude=.svn'

I still fail to see a reason to use ack, especially when I can assume grep is always available for portability.

_0nac · on Nov 29, 2012

...and many many other neat features that I'm sure grep has, but you have to remember and memorize them. ack has sensible defaults.

That's why.

pixelbeat · on Nov 28, 2012

For a similar but much faster tool than ack, which simply wraps `find` and `grep` in the UNIX tradition, see:

http://www.pixelbeat.org/scripts/findrepo

chimeracoder · on Nov 28, 2012

> doesn't require admin privileges to install.

To be fair, neither does GNU Grep - just do `make' (without `make install') and you're good to go.

Mordor · on Nov 28, 2012

Not trying to start a war here either, but is it faster?

pooriaazimi · on Nov 28, 2012

It's faster than BSD grep, but (in my experience) slightly slower than GNU grep. The productivity boost though, is enormous.

martinp · on Nov 28, 2012

'why GNU grep is fast' from the FreeBSD mailing list: http://lists.freebsd.org/pipermail/freebsd-current/2010-Augu...

haberman · on Nov 28, 2012

Also classic: http://ridiculousfish.com/blog/posts/old-age-and-treachery.h...

georgebashi · on Nov 28, 2012

Which was discussed previously on HN, here: http://news.ycombinator.com/item?id=1626305

Mordor · on Nov 28, 2012

> The key to making programs fast is to make them do practically nothing.

achille · on Nov 28, 2012

I just replicated the test and I can confirm the FreeBSD grep compiled on Darwin is about 30x slower.

    % /usr/local/bin/grep --version         
    /usr/local/bin/grep (GNU grep) 2.14
    <snip>

    % time find . -type f | xargs /usr/local/bin/grep 83ba
    find . -type f  0.01s user 0.06s system 8% cpu 0.870 total
    xargs /usr/local/bin/grep 83ba  0.66s user 0.31s system 95% cpu 1.017 total


    % /usr/bin/grep --version                 
    grep (BSD grep) 2.5.1-FreeBSD

    % time find . -type f | xargs /usr/bin/grep 83ba 
    find . -type f  0.01s user 0.06s system 0% cpu 28.434 total
    xargs /usr/bin/grep 83ba  31.65s user 0.40s system 99% cpu 32.113 total

jasomill · on Nov 28, 2012

There was also some discussion about this on one of the Apple mailing lists a few months ago, and it turns out there are major differences in how the two grep implementations on OS X interact with the buffer cache. In particular, empirical evidence suggests 10.6's GNU grep build caches its input, while 10.7+ BSD grep does not.

Incidentally, on OS X, you can commonly get another order of magnitude improvement over even GNU grep with Spotlight's index: use xargs to grep only through files that pass a looser mdfind "pre-screen".

pixelbeat · on Nov 28, 2012

I notice these Mac tools becoming a bit stale. sort is derived GNU sort, but from some ancient version. I guess this might be due in part to these tools now being GPLv3 ?

paxswill · on Nov 28, 2012

Almost certainly. Apple stopped updating their tools past the GPLv2 versions, with the most noticeable example being gcc, which was frozen at 4.2 until they removed it.

saurik · on Nov 28, 2012

https://news.ycombinator.com/item?id=3559990

X-Istence · on Nov 28, 2012

This may also be because the default grep, i.e. BSD grep actually pays attention to what you have set in your environment variable LANG. Default on OS X is en_US.UTF-8.

If the author were to set LANG to c. He would find that BSD grep suddenly speeds up tremendously.

pdw · on Nov 28, 2012

GNU grep certainly honors locale settings, and recent versions are fast even when you're using UTF-8 (since release 2.7 or so).

X-Istence · on Nov 28, 2012

Hmm, interesting. Work is being done on BSD grep to make it faster than it is, so hopefully in the near future the two will be on par.

mattparlane · on Nov 28, 2012

For those using homebrew:

    brew install https://raw.github.com/Homebrew/homebrew-dupes/master/grep.rb

paxswill · on Nov 28, 2012

If you tap the `homebrew-dupes` repository, you will get updates in the future:

    brew tap homebrew/dupes
    brew install grep

lyso · on Nov 28, 2012

Are there any other utils worth installing from that tap? Awk? OpenSSH?

paxswill · on Nov 28, 2012

The repository with all of the formulas is here: https://github.com/homebrew/homebrew-dupes

The two I end up using semi-frequently are gcc and apple-gcc, for those projects that Clang just won't compile.

eik3_de · on Nov 28, 2012

You should tack a "LC_CTYPE=C" in front of grep to get comparable results. A multibyte CTYPE can slow down grep up to factor 30.

emidln · on Nov 28, 2012

Is speed really that much of a concern with grep? I typically use :vimgrep inside of vim, not because it's faster (it's orders of magnitude slower due to being interpreted vimscript), but because I hate remembering the differences between pcre/vim/gnu/posix regex syntax.

jlebar · on Nov 28, 2012

I regularly search my whole Firefox clone for keywords. If this takes 2s, that's plenty fast; if it takes 20s, I'd have to come up with some other way of doing it.

Evbn · on Nov 28, 2012

Ctags?

jlebar · on Nov 29, 2012

Firefox is quite complicated; we have code written in C, C++, JS, Python, Make, m4, plus at least three custom IDL formats. grep handles these with ease.

_delirium · on Nov 28, 2012

I use grep in some pipelines to bulk-process data, because if you have a fast grep, using it to pre-filter input files to remove definitely-not-matching lines is one of the quickest ways to speed up some kinds of scripts without rewriting the whole thing. And in that case, sometimes processing gigabytes+ of data, it's nice if it's fast.

One common case: I have a Perl script processing a giant file, but it only processes certain lines that match a test. You can move that test to grep, to remove nonmatching lines before Perl even hits them, which will typically be much faster than making Perl loop through them.

Say your script.pl is doing something like:

    next unless /relevant/;

You can replace that with:

    grep "relevant" filename | perl ./script.pl

niggler · on Nov 28, 2012

At the scale you are talking about (10Gb+ files), it's far more efficient to put primitive filtering in the application generating the lines in the first place. you pay two penalties for using grep: having another process touch the data and having to generate superfluous lines in the first place.

wtetzner · on Nov 28, 2012

This doesn't work if you're processing logs. You might need those other lines in other places.

_delirium · on Nov 28, 2012

In this case, alas, I'm processing third-party data I didn't generate, so one way or another I have to scan through it at least once.

johncoltrane · on Nov 29, 2012

    :vimgrep

is slower because it loads each file in memory with all the filetype-specific stuff going on each time before the actual searching.

buster · on Nov 28, 2012

Obviously this means, Linux is 10x faster then Mac, ha!

Seriously though, it's really amazing what performance they squeezed of that tool. Always amazing to grep through gigabytes of files in a few seconds.

pooriaazimi · on Nov 28, 2012

I once tried a sed script on a couple million text files (60 GB in total) - they were web pages downloaded in some format (WARC? I don't remember what it was called) and I needed to change the formatting slightly (to feed them to Nutch) - Mac's default sed was literally 50 times slower than gsed (on the same machine). If I remember correctly, gsed finished the task in under two hours.

tehwalrus · on Nov 28, 2012

just tried on snow leopard, not quite 10x but nearly 2x faster, certainly. (admittedly, by firefox checkout is mercurial, and hg locate seems to pass something invalid to xargs half way through, but I guess the first chunk of files are the same.)

Someone commented on the article that this might be caused by missing off the -F flag; I tried this, and -F makes both versions slightly faster again.

xtrahotsauce · on Nov 28, 2012

Does "git grep" use a system grep or does it implement grep on its own?

meaty · on Nov 28, 2012

It uses its own. See https://github.com/git/git/blob/master/builtin/grep.c

unwind · on Nov 28, 2012

That seems to be the "command infrastructure" for the grep builtin. The actual grep engine is in https://github.com/git/git/blob/master/grep.c.

drothlis · on Nov 28, 2012

The article has an answer to that. I'm not sure if you're challenging the article's answer or if you just missed it...

xtrahotsauce · on Nov 28, 2012

Oops, missed that part of the article, thanks!

wildranter · on Nov 28, 2012

Or...

  brew install ack