pcre_exec()'s length and offset parameters are ints, so there's not much I can do about files over 2GB. I really don't want to split the file into chunks and deal with matches across boundaries. That's just asking for bugs. I guess I could make literal string searches work, at least on 64 bit platforms.
Honestly though, I don't think ag is the right tool for that job. For a single huge file, grep is going to be the same speed. Possibly faster, since grep's strstr() has been optimized for longer than I've been alive.
I gave some thought to the right tool for the job of searching DNA.
DNA files don't change very often, which makes building an index worthwhile. Apparently, sequencing isn't perfect and neither are cells, so you'd want fuzzy matching. But repeats in DNA are also common, so that means fuzzy regex matching. There is already a fuzzy regex library[1], but I have no idea how fast it is. If the application requires performance above everything, an n-gram index sounds like the right tool for the job.
After writing the paragraph above, I searched for "DNA n-gram search." The original n-gram paper from 2006 used DNA sequences in their test corpus.[2] I don't know much about DNA or the applications built around it, so I'm glad I managed to recommend a tool that was designed for the job.
I built ag for myself; both as a tool and to improve my skills profiling, benchmarking, and optimizing. Had I known how popular it would become, I would have definitely held myself to a higher standard, or any standard. Most importantly, I'd have written tests. These days, I'm busy with a startup so progress on those fronts has been slow.
ag is incredible, especially paired with Ack.vim and a mapping. I use <leader>as to search for the current word under the cursor. The results are instantaneous. With ag and YouCompleteMe, I never fall back to cscope/ctags in C++ projects anymore.
One thing though, it skips certain source files seemingly arbitrarily without the -t param and I haven't figured out why... Doesn't seem related to any .gitignore entries that I have been able to identify.
The silver searcher is pretty good. but it has a couple of big problems. It does not parse the .gitignore correctly [0], so it frequently searches files that are not committed to your repo. This, combined with the decision to print 10000 character long lines mean a lot of search results are useless.
I noticed the issue you mentioned, but as the last comment mentions, I believe this has already been fixed. My specific case at least was resolved by updating from master.
One thing I miss a little is that ack has the super convenient:
ack --java "foo"
while with ag you write:
ag -G"\.java$" "foo"
But yes, ack and ag feel pretty identical except for the speed. Most of the time the speed improvement is irrelevant to me, except sometimes now I'll use ag in my home folder, and it's still fairly snappy.
That was too much typing anyway. When you mostly work with one language something like this is nice (in my case c/c++):
alias ack-cpp='ack-grep --type=cpp --type=cc'
Hm, I've recently begun using zsh primarily and this trick doesn't work there: zsh lets you know what the alias is... bash will happily find `rack` in your `$PATH` and then run it.
(Presumably because in zsh, `which which` says it's a shell built-in, whereas in bash it finds `/usr/bin/which`, so bash doesn't seem to be caring about your aliases.)
I normally tell people to use ack because it's like grep but faster (owing to it's sensible defaults) ... if I use this I'm worried I might go too fast and travel backwards in time or something.
In my benchmarking, mmap() was about 20% faster than read() on OS X, but the same speed on Ubuntu. Pretty much everything else in the list (pthreads, JIT regex compiler, Boyer-Moore-Horspool strstr(), etc) improves performance more than mmap().
Also, mmap() has the disadvantage that it can segfault your process if something else makes the underlying file smaller. In fact, there have been kernel bugs related to separate processes mmapping and truncating the same file.[1] I mostly use mmap() because my primary computer is a mac.
Now I'm burning with curiosity. I have to know why! My plan:
- replicate the experiment, confirm --mmap shaves off a non-negligible amount of time. It could be that his computer happened to be running something in the background that was using his harddrive, for example, which would skew the results.
- look at the code, figure out the exact difference between what --mmap is doing and what it does by default. Confirm that the problem isn't in grep itself (it's probably not, but it's important to check).
- dig into the kernel source to figure out the difference under the hood and why it might be faster.
I wonder if it has to do with not having to copy data back and forth between kernel and userspace. My mildly uneducated thought is that you could do this with splice() or whatever, but mmap is an easy drop-in replacement.
edit: I've been reading your posts for a while and I like them, but I keep wondering, why do you have sillysaurus1-2-3?
That's what has me so curious, because it doesn't seem like copying between kernel/userspace should account for a 20% speed drop. Once data is in the L3 CPU cache, it should be inexpensive to move it around.
Regarding my ancestry, I'm sillysaurus3 because I've (rightfully) been in trouble twice with the mods for getting too personal on HN. I apologized and changed my behavior accordingly, and additionally created a new account both times to serve as a constant reminder to be objective and emotionless. There's rarely a reason to argue with a person rather than with an idea. Debating ideas, not people, has a bunch of nice benefits: it's easier to learn from your mistakes, it makes for better reading, etc. It's pretty important, because forgetting that principle leads to exchanges like https://news.ycombinator.com/item?id=7700145
Another nice benefit of creating a new account is that you lose your downvoting privilege for a time, which made me more thoughtful about whether a downvote is actually justified.
Possibly the OS is doing interesting things with file access and caching and opting out of that has benefits for this particular workload?
...
I just skimmed the bsd mailing list email on why grep is fast which was linked up-thread, and it seems that's somewhat the case. It sounds like since they are doing advanced search techniques on what matches or can match, they use mmap to avoid requiring the kernel copy every byte into memory, when they know they only need to look at specific ranges of bytes in some instances. At least that was the case at some point in the past.
Finally, when I was last the maintainer of GNU grep (15+ years ago...),
GNU grep also tried very hard to set things up so that the _kernel_
could ALSO avoid handling every byte of the input, by using mmap()
instead of read() for file input. At the time, using read() caused
most Unix versions to do extra copying.
P.S. Nice attitude, it earned an upvote from me. Which is probably one reason why your third account has more karma than my first.
Right, I think the point of boyer-moore is that it allows to eliminate / skip large chunks of the text during the search.
So the assumption is that those pages don't even ever get swapped in, but I think that'd only be the case when the pattern size is at least as large as the page size (usually 4KB!), which is not the case in the example in the mailing list. So the mystery continues!
The last time I had to do fast, large sequential disk reads on Linux it was surprisingly complex to get all the buffering/caching/locking to not do the wrong thing and slow me down a lot. I wouldn't be surprised if non-optimized mmap() is a whole lot faster than non-optimized use of high level file i/o libraries.
If anything, that post is evidence of how tricky optimization is, and how easy it is to fool yourself about what matters. It's probably best to be skeptical about mmap() as a performance optimization over reading into a buffer unless evidence demonstrates otherwise. Most OS's do a pretty good job of caching at the filesystem level, and under the hood paging is essentially reading into a buffer anyway. mmap() might make the code simpler, but it's hard to imagine it makes it faster. If it does, I'd like to understand why.
So are we talking about constant-time optimization, then? I.e. it shaves off a few milliseconds regardless of how complex the search is, or how many files it's reading, or how large each file is. I'll happily concede that mmap() might do that. But a performance boost linear w.r.t. search complexity/number of files/filesize? Hard to believe, and I should go measure it myself to prove the point or learn why I'm mistaken.
Constant-time improvements are still improvements, especially if they're in an inner loop. Otherwise we would all be using Python and just writing great algorithms.
I use ag[2], which is pretty much the same as ack, but even faster. The other day I was using it to find all instances in all projects of a list of problematic method names[1], in case anyone wants to see a real world use case.
The only annoyance with ag is it does not have ack's quick filters e.g. ack --py versus ag -G '\.py$' (and ack's type flags can include multiple file extensions).
For Java programmers who use Silver Searcher or ack, this lets you search all jars in a directory tree for a given string. Requires GNU Parallel:
function ffjar() {
jars=(./**/*.jar)
print "Searching ${#jars[*]} jars for '${*}'..."
parallel --no-notice --tag unzip -l ::: ${jars} | ag ${*} | awk '{print $1, ":", $5}'
}
Because it uses parallel it spreads the workload across CPUs. I use this frequently when I have to update/rewrite/create build scripts, and I know a class exists but not which jar file it lives in.
`xargs` also has a `-P` flag which will instruct it to spread work over multiple processes. Given that you already have `-n1`, adding a `-P 0` will have it automatically spread out over all your CPUs.
Yeah, I think it's important not to throw grep away. Ack is really for when you don't know or can't be bothered to explicitly mention the (several) specific files to search.
For some reason, it took me one or two minutes of rereading to realize it was ack, not awk. I think this website was going to be some ironic trash-talking about grep. Then I saw "written in Perl" and I got so confused my head almost exploded.
> I think this website was going to be some ironic trash-talking about grep.
Andy Lester, the primary author of ack, is one of the nicest guys I know of. You wouldn't see any trash-talking on that site. He even changed the name of the site from "better than grep" to "beyond grep" [1].
In fact, he gives props to similar tools like ag and others [2].
In my case, my IDE is the command line. ack is one of its plugins. The builtin plugins are also ok (find, ls, mv, etc.). I can create my own plugins for my IDE, and there are even package managers to install new plugins (yum, apt-get).
I've often wondered about that. For me, the development environment is extremely minimalistic, by some types of standards. Linux itself, including tools like ack + vim. I use various vim tricks, not up to the point of it being my de-facto OS (as is possible to do!).
From what I can observe, I am generally faster than my co-workers. But it could be possible that with a great IDE, I could be faster yet. I don't feel any tug to leave, but could just mean I'm ignorant of that truly better way.
It's better if you don't have the IDE for that project open already. Or if you search for a project which doesn't come with project files for your IDE of choice. Or if you want to pipe the results. I work a lot with IDE's but still use ack-grep regularly.
I seem to find things faster than my coworkers. The ability to quickly filter out non relevant files and do nested searches of the searches is the strong point. Unix as an IDE and all.
These are just a few examples I do pretty frequently:
Nested Search:
ag functionName | ag moreSpecificContextLikeArgs
Find variable changed yesterday:
git log -p --since yesterday | ag varName
Find controllers changed yesterday:
git log --oneline --showfile | ag controllers
What files did I work on last week:
git log --name-only --oneline --author me --since 1.weeks
How many JS commits did I do last month?
git log --since 1.months --author me --name-only | ag -i '\.js$' | wc -l
How many JS commits did I do on each file last month?
git log --since 1.months --author me --name-only | ag -i '\.js$' | awk '{arr[$1]++} END {for(i in arr) print arr[i]," - ",i}' | sort -r -n
Change a "classname" from MyClass to BetterName:
ag MyClass # verify it only finds what you think it will
ag MyClass | awk -F':' '{print $1}' | sort | uniq | while read line
do
sed -i' ' 's/MyClass/BetterName/g' $line
done
My ex-colleague introduced this to me and I thank him every time I use ack. It is really so much better than grep.
I have set up a bunch of aliases to search by file type and it makes me so productive.
I used Ack a lot when I was coding Perl (it's been a while). After I switched to Ruby, I used rak [1], which seemed easier to use most of the time, and nearly identical.
However, when you just want to find stuff fast, it's annoying to have to deal with Perl/CPAN or RVM/Rubygems, especially when the dependencies are not installed on your server/workstation.
That's why I've switched to silver searcher (ag) [2], as it can be installed with any OS package manager (brew, apt, yum).
ag is not available as a package on debian stable. Ack is though; as ack-grep. So if you don't want to mess with CPAN that's fine. The non-CPAN instructions are right on the website.
The problem with such tools is often their lack of ubiquity. I don't want to start using ack, forget a lot of my grep knowledge, only to ssh into a server and need grep.
The benefit of grep's ubiquity outweighs any small advantage ack has in usability.
ack is a tiny perl script that you can simply wget and add to your path. I hear what you are saying, and I think that it applies to a lot of utilities, but not ack. Imho ack is so much better than grep that it is worth the hassle of having to install it every now and then.
There's a list of other tools for search source code besides ack at http://beyondgrep.com/more-tools/, including other grepalikes and indexing tools like ctags and cscope.
I suggest that you need not limit yourself to only one tool for your code searching. Toolboxes FTW.
Grep is just as good, and with the recent order of magnitude speed improvement on non-C locales that they made -- see https://lwn.net/Articles/586899/ -- which may not have made its way into distros yet, it's easily the best option.
I have a simple wrapper over egrep (see https://github.com/sitaramc/ew ) that adds those little extras (ignoring binary files, ignoring VCS directories...).
I'm sure it's improved since the days I tried it, but I tend to be permanently prejudiced against tools where the author can't/won't document the file selection logic and says "there's really no English that explains how it works" when someone asks.
Ack is great, but watch out if you have any source files with unusual file name extensions. Ack will only search file types it knows about. Also if you have your whole source tree in your editor or IDE, then you may as well search there instead.
Addressed in ack's FAQ [0], and in its own section of the manual [1].
The manual explains: "This is done with command line options that are best put into an .ackrc file - then you do not have to define your types over and over again." Then comprehensively describes options for both command line and .ackrc.
cscope and ctags are language syntax dependent searching tools for c-like programming languages. They let you search specifically for all instances of a function names 'foo' for example. Ack is instead just a normal pattern matcher like grep except that it has some cleverness by which it knows not to search certain file types and directories. It will return all lines which match a string rather than just variable names or functions.
cscope lets you search for arbitrary text strings and egrep patterns as well.
"The fuzzy parser supports C, but is flexible enough to be useful for C++ and Java, and for use as a generalized 'grep database' (use it to browse large text documents!)"[0]
They solve different problems. Ctags and cscope index a corpus of source code, usually tied into another tool, like Ctrl-] in vim. ack searches the files every time.
It depends what you mean... as others have mentioned[1], neither ack or ag are particularly fast compared to grep, they just give you a lot of specialized context (search the right files). As such, what would be to find as ack is to grep? A find that automatically filters out files that are not source code files?
[1] Things might have changed since the last time I personally tried this, at the time grep was significantly faster, especially for fixed string searches -- but then again, I never tried to coerce up a command line that gave the same kind of output that ack/ag does (which could probably be hammerd out with help of awk). So don't take my comment to suggest that these tools aren't valuable, just maybe not for the reason some people (notably not the authors of said tools) claim.
> find that automatically filters out files that are not source code
Not just that but an extensible set of file type filters that are simple to invoke is what I had in mind. E.g., the tool would let you perform searches like
find++ --Python projects/archive/200?
or
find++ --video trailer
where in the latter case the hypothetical find++ would refer to my config to get a list of video file extensions and then print a list of all files in the current directory and its subdirectories with the word "trailer" in their name. For better effect it would ship with useful filters like "--video" by default.
Right. It's not entirely straight forward to link up the mime database (via eg: file) and generating filters for use by find. Basing filters off of filenames isn't a very good idea -- and actually a little regressive in my opinion -- after all project/bin/foo (executable) might be a python or perl or whatever script -- not just a binary file.
But first getting all files via find, then testing with file, and finally matching against mime-type doesn't sound like something that's going to be as fast as possible...
I tried to see if maybe gvfs (gio - gnome io) could help, but couldn't really find anything directly applicable (although there is a set of gvfs command line tools, like gvfs-ls, gvfs-info, gvfs-mime).
> after all project/bin/foo (executable) might be a python or perl or whatever script -- not just a binary file.
That's one of the big features of ack that the find/grep combo can't replicate is checking the shebang of the file to detect type. In ack's case, Perl and shell programs are detected both by extension:
I'd prefer checking the magic numbers in general (or resource forks) -- and list based on mime-types -- rather than just shebang/extension. I'm sure there's frameworks ready for doing this -- both gnome and kde (among others) have been working on this for a while. You need it do be able to display (correct) file icons, for example. And once one goes down that route, it might be beneficial to leverage one of the frameworks for file-search (from locate db to something based on xapian or what-not) -- rather than find-style traversal.
I suppose this might be too late, but it might be worth having a look a tracker[1], and tracker-search[2]. Alternatives include recoll and Beagle (now defunct?).
I have a fairly simple alias that does a find, but excludes directories like .svn .git etc. and a separate one that excludes common binary extensions as well (.o .fas .fasl etc.)