Hacker News new | past | comments | ask | show | jobs | submit login
If you use GNU grep on text files, use the -a (--text) option (utcc.utoronto.ca)
203 points by rurban on April 21, 2020 | hide | past | favorite | 98 comments



I just looked through the GNU grep history to see when it suddenly started being able to decide halfway through a file that it is binary after all; this is since 16 September 2014, so fairly recently. Before that, it just checked the first few kilobytes to decide, and didn't change its opinion afterwards. To me, this is a very nonintuitive change.


Oh yes. I was bitten by that when I piped around some tool's output that eventually went through grep. The tool's output was actually text but for some reason there was a null byte at the end which nobody noticed before.

The fun part was that the way the data got chunked through the pipes was not deterministic, so sometimes you got the desired output from grep, but other times just "binary file matches", even when the raw output from the first tool was identical in both runs. That took quite some head scratching to figure out.


Why is it that in recent years, with this and the more recent ls quoting fiasco, maintainers of longstanding UNIX utilities suddenly got the urge to fix what isn't broken?


What's the 'ls quoting fiasco'?

Actually I recently found that coreutils and ls behave fairly well with funny filenames:

Here is an invalid utf-8 byte and then a valid utf-8 sequence

    $ x=$'\xce\xce\xbc'
    $ touch "$x"
You can list it:

    $ ls
    ?μ
And here 'ls' does better than other tools that display filenames. It shows the invalid byte and then keeps decoding with error recovery:

    $ ls --escape
    \316μ
However GNU stat (which I think is also in coreutils) does something similar, but weirdly messed up:

    $ stat *
    File: ''$'\316''μ'
(it looks like it's outputting a valid shell string, except with extra quotes)

-----

Most command line tools are not aware of stuff like this. For example you can touch "x$ANSI_TERMINAL_CODES" and if you do "bash x??" or "python x??", then your terminal will change color because of the escape codes printed back to the terminal.

I just changed Oil to use a well-defined format I called QSN (quoted string notation):

http://www.oilshell.org/blog/2020/04/release-0.8.pre4.html#t...

It adapts Rust's string literal syntax to express arbitrary byte strings precisely and losslessly. (JSON can't express arbitrary byte strings.)

The QSN encoder does UTF-8 decoding with a specific error recovery mechanism. So it's basically like what ls and stat do, but it's more precise.

(If anyone is interested in QSN, please contact me. I think it's more generally useful in a lot of places. It's something we already do but it's precise like JSON.)


They broke it in 2016.

https://www.gnu.org/software/coreutils/quotes.html

At least with Gnu, you can recompile your own, non-broken version, which is the only saving grace of these stupid, trendy changes.


Not broken at all. I really don’t understand the hate for this change.

I deal with a lot of filenames with spaces and think this change is a great improvement for listing such files. With this change it’s much easier to see where one filename ends and the other begins. Before this change, I had to use the `-1` option to ensure that each filename was listed on a line by itself. Now the filename listings are much more readable and it takes less cognitive effort to take it all in.

The way it handles filenames with ASCII apostrophes/single quotes works particularly well (wraps the filename in double quotes instead of single quotes) and makes it very easy to copy and paste filenames to and from the terminal.

Best of all, this change only applies when standard output is a TTY device so this does not break any shell scripts (even though parsing `ls` is a bad idea in any case) and is still compliant with the POSIX specification[1] which states that “If the output is to a terminal, the format is implementation-defined”.

1. https://pubs.opengroup.org/onlinepubs/9699919799/utilities/l...


The old behavior was the bad one; the new behavior is good. And that link explains why very well.

I guess I should have figured that oblique references to "ls quoting fiasco" is shorthand for "I don't understand what's wrong but I'm angry about it..."

(On the other hand I would say the grep -a issue is bad both before and after because either way it relies on autodetection. The fundamental issue there is that there is too much variance in encodings, which isn't easy to fix. Luckily UTF-8 is growing in popularity, and it doesn't have this issue because it doesn't require metadata for extremely common operations like "find ascii substring".)


”The old behavior was the bad one; the new behavior is good”

If you’re a human, yes. If you’re a script, it breaks you in half. If you’re a script that has to run on various versions, then maybe it’s time fix yourself and use find. You’re a sophisticated script after all, not one of these who require a human with a debugger. Modern culture may not appreciate that little ‘compat’ thing, but it is essential if you want something to continue to work and not just stop and wait for someone’s educated guesses. Good software doesn’t point fingers at you, it just works. I remember how recently I wanted to check network interfaces on some machine and commanded ‘ifconfig’. Now it’s called ‘ip a’, and there is no ifconfig. I can guess the reason – ifconfig was bad and ip is good. There is also an eternal “FAT” label issue in unetbootin app, which resurrects every time Apple changes its fdisk output format (in every release, as it seems). The workaround is to run it with a cli option – a very thing that unetbootin was created for to skip. This is what makes systems so much fun. Without all these cool things, we would just sit there and cry over our uselessness.

ed: I read below that ls does that in interactive mode only, maybe it’s not that bad then.


You can't just expect compatibility about random things - that's why we have formal contracts: standards, specifications, documentation. A human-readable output of any app, in particular, should never be assumed to be stable, or to have a specific format (even if observations imply it), unless its docs specifically say otherwise.


A script will see the old-style output. The new-style output is tty only


Hats off if your script used ls and handled weird characters in filenames just fine.


> but I'm angry about it.

Take it to Reddit. I know perfectly well what's going on, and also know better then to discuss it with people who act like that.


It would seem that calling ls with the -N switch would disable the quoting. Might be easier to just have ls aliased to "ls -N" rather than recompiling. Unless the -N switch also does something else you don't like.


again, aliasing it is fine for interactive use.

the point is that this breaks an unbelievable amount of already deployed scripts. The new functionality should be optional, and accessed via switches and aliases if you like it.

this was very much a change only a few people liked, that they decided to force down literally everyone else's throats. it's very poor stewardship.


Okay. The GNU page says the quoting is only done when the output is a terminal, so it shouldn't generally affect scripts. Although considering that the world of unix shells is fairly complex, I wouldn't be surprised if there was some kind of a weird but explainable situation where some kind of a script would still break, so I suppose it could be a problem somewhere.


GNU ls prints the filenames already-quoted as of late:

    % touch foo
    % touch 'bar baz'
    % ls
    'bar baz'   foo


A google search for 'ls quoting fiasco' shows this thread as the top result. Could you explain what it is?


There was a change in ls which caused it to quote file names with spaces or special characters instead of (I think?) escaping them. Some people were upset because they considered this a breaking change. I'll just add that ls only does this if it's running interactively so I wouldn't expect it to break scripts.


I'm a little surprised people are using ls in scripts. I'm normally using find, with the mill separator to deal with spaces in filenames.


... null separator ...

I thought I proof-read that sentence when I wrote it...


Because of the prevalence of UTF-8 and its consequences/pitfalls?


That's the best explanation I can think of too.


"GNU's Not Unix!"


> this is since 16 September 2014

Can you be more specific? I don't see any commmit or release on that date.


I see 10. https://git.savannah.gnu.org/cgit/grep.git/log/?ofs=550 (over time, this link will start to point at the wrong date)


My bad, I looked only at author dates. (But then I missed some commits anyway?!)

I guess you had this commit in mind:

https://git.savannah.gnu.org/cgit/grep.git/commit/?id=cd36ab...


I mostly run into this on searching my environment: "set | grep whatever", now needs an "-a", possibly because of escape codes added to the environment a decade ago.

Maybe the fix would be to only activate the "detect binary files" code if stdout isatty?

Because it is a nice feature when I do a big grep to find something among my home directory or the entire filesystem. It is certainly annoying to get binary garbage in my terminal. Or maybe the binary detection could get smarter, maybe making the determination on a match-by-match basis ("This line I'm about to output is a kilobyte and half of it is non-printable", say).

Though, ack-grep doesn't seem to avoid putting binary garbage on my terminal, so maybe reasonable to switch to something that isn't so clever? Most of my terminal greping is done with ack these days, so I'd probably be happy with gnu-grep disabling this cleverness.


Usually you want this feature no matter where the output is going. Adding "-a" sucks, but it's not obvious how else this could work (and still be backward-compatible).

IIRC the grep heuristic only considers a short prefix of the file. If the garbage comes later, you lose. Unfortunately, this makes things seem a bit unpredictable.


> IIRC the grep heuristic only considers a short prefix of the file. If the garbage comes later, you lose. Unfortunately, this makes things seem a bit unpredictable.

This was changed about five years ago to just keep looking. Which makes things a bit unpredictable in a different way.


I probably do not want this feature the case of a script, as this post talks about. Which is why I was suggesting if isatty(stdout). Most people do not want their shell scripts functionality to depend on whether there is a "high character" somewhere in it.


Does someone know if using grep on a binary file is somehow defined by POSIX?

At a glance, I couldn't find a reference on the grep page:

https://pubs.opengroup.org/onlinepubs/9699919799/utilities/g...


It isn't. The page says:

> The input files shall be text files.

"Text file" is defined in https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1... :

> A file that contains characters organized into zero or more lines. The lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in length, including the <newline> character.


Thank you :-)

Interesting, especially the part about the LINE_MAX. Even though it kinda makes sense, I would never have thought that having a very long line makes a file a non-text file when all characters are 'normal' characters.


Id kind of like that. Grepping in a directory also containing minified js (with ten thousand character long lines) is a pain


So, what does trigger grep to regard a file as binary/non-text? For me, it seems these are not only NUL characters.

And I also think that this would not be a discussion if it was just about NUL. There is something else that grep does not like (escape characters? Invalid utf8? Wrong codepage? I don't know), but this makes the binary detection annoying.


I haven't looked at the GNU grep source, but yes, invalid UTF-8 is one of the reasons for treating file as binary:

   $ printf 'a\222b' | LC_ALL=en_US.UTF-8 grep a
   Binary file (standard input) matches
This is POSIX-compliant, because "character" is defined as:

> A sequence of one or more bytes representing a single graphic symbol or control code.

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...


Actually, I never considered log files untrusted input, but as this example shows, it would be wise to do so.


FWIW, attacks like Javascript or SQL injection via logfiles are hardly unknown. Log files are plenty scary. ;-)


Someone did a POC of a CSRF against the admin interface for Cisco routers. They sent garbage packets that sufficiently confused the 'hex editor' display view in the admin web pages in such a way that it made a request to another page, changing permissions.


I’ve seen ansi sequences embedded in (malformed) http requests in my server logs. Presumably they expect people to view the log in a terminal or editor with some sort of vulnerability.


Oh yes. And a target for PII redaction.

Logs need a lot of attention.


Also beware of CWE-117


https://cwe.mitre.org/data/definitions/117.html

'CWE-117: Improper Output Neutralization for Logs'

That is something probably often forgotten when simply dumping some requests into a log, but at least it should be obvious that the source of the content is untrusted. On the other hand, a log file is a file on your server, so you would probably think of it as nothing dangerous, as everybody has cared about CWE-117, right? ;-)


As a developer, if I catch an exception handling some piece of data, I'm logging that data. Nothing irks me more than a file-not-found exception without knowing WHICH file. For example.


I see this and suddenly it clicks! That is exactly why I couldn't import an SQL dump that I tried importing for days now, that is filtered with grep. Wow.

And I was wondering all the time why mysql reported this strange error "SQL error in Binary file" when the .sql file was clearly a text file...


I started using ripgrep a few years ago and haven't looked back. It's way faster, automatically excludes .gitignored files, and just has a bunch of common sense functionality.

https://blog.burntsushi.net/ripgrep/


This 'feature' is especially irritating when one uses grep on some text files with legacy (non UTF-8) encoding, but has locale with UTF-8 encoding. The grep decides that regular text file is binary just because there are byte sequences that are not valid UTF-8 sequences.


How... How can grep possibly work on a strange non-utf8 encoding if you don't say to it?


If you're grepping for ASCII strings, then the UTF-8 pattern will match the Latin-1 file.


how can you expect to find ASCII strings in the parent's "text files with legacy (non UTF-8) encoding"?


Because some of them, like Latin-1, which i mentioned in the comment to which you replied, are supersets of ASCII.


Also precede the file list by `--`. A very confusing thing can happen if a file happens to begin with a dash... (You can intersperse options like `-e pattern` among file names, if for some reason you wanted to do that.)


But that's not grep specific and generally a good idea, especially in scripts that get the file names from their command line, some input file or god knows where.


If the command line utility doesn't support -- or equivalent then you can prepend "./" for untrusted filenames.


Have you thought about environment variable GREP_OPTIONS? https://www.gnu.org/software/grep/manual/grep.html#Environme... You can define it at the beginning of the script.


> As this causes problems when writing portable scripts, this feature will be removed in a future release of grep, and grep warns if it is used. Please use an alias or script instead.


Love ack for source code and text docs: https://beyondgrep.com


I used ack for a while, switched to ag (I don’t remember why? FOTM maybe?), and finally ended up with ripgrep. If you haven’t tried ripgrep you definitely should. It has almost completely replaced gnu grep for me.


I did the same but stayed with ag. For me, like with its contemporary find replacement fd, it's the UX that provides the most benefit.

The speed benefit isn't really a huge factor; working how I'd expect, omitting ignored files, being able to specify file extensions to search, plus simple editor integration. Amazing.


I keep meaning to replace find, but never get around to it. I will check out fd. Ty.


Also works well on non-text files, similar to `strings`! I don't think it works as well, but can still be useful for quick checks.


I normally do:

strings file | grep search_pattern



It's the default since 2014:

https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=c...

edit: I would say that if you are doing forensics on an untrusted binary and you are not using a dedicated VM for it then you are not careful enough. objdump, nm are still attack vectors, not to mention debuggers and disassemblers.


Oh man, I've had this issue before and I just chose to nuke the logs and try again, thinking they were corrupted!


I normally use -I to skip binary files


Why is this even a thing?

This is a serious anti-feature as far as I can see.. can someone clarify otherwise for me?


Try `cat /usr/bin/*` in your terminal


> 'LC_ALL=C'

Wuff, reminds me of the completely incompatible difference between BSD sed and GNU sed


shoutout to [RipGrep](https://github.com/BurntSushi/ripgrep), which is generally faster, has more intelligent defaults (searches cwd by default, ignores files matching .gitignore), and can search through only certain text files (like your .java and .py files, say). Not affiliated, just found it worth the effort to learn some slightly different flags, though many are the same as normal grep.


Also a ripgrep fan. I rolled my own search tool[1] due to dissatisfaction with the available options at the time, but I gave ripgrep an evaluation when it came out and it was really good. I ultimately still use my own tool because I'm very attached to the way grep colors/formats stuff (and I implemented the same scheme in findref) but I use ripgrep on large code bases (like the Linux kernel) since it performs a lot better due to various optimizations.

[1] https://github.com/FreedomBen/findref


Small note: if you want ripgrep's output to match grep's, then you can do

    alias rg="rg --no-heading"
It will still show file paths and line numbers. Those can be disabled too, with `rg --no-heading --no-line-number --no-filename`.


oh neat, thank you!


> ignores files matching .gitignore

Maybe it's just me, but that sounds like a bad default. I can definitely imagine people being confused by that.


They are. That's why it's always mentioned in the first few sentences of docs (man page, --help, README). With that said, this default is one of ripgrep's defining features and is something that users consistently report as one of their favorite things about ripgrep.

You can disable all smart filtering (gitignore, hidden, binary) with `rg -uuu foo`. That will search the same stuff that `grep -r foo ./` will.


It is. At first. But you get used to it quite fast. In my experience, when I have a .gitignore I either want to grep the ignored stuff or the rest, never both. So I notice quite early that something is odd, when rg reports nothing at all.


A problem-scenario that comes to mind involves "set your own" config files, like when a codebase has a config.xml.dist and you're supposed to copy and customize it to config.xml which should never get checked in.


It's less confusing if you think of it as a code search tool, rather a text search tool. Though now I realise how useful I would find this feature.


IIRC VS Code also uses ripgrep internally for searching.


Shoutout to git grep. I use `git grep -n -E --untracked` (via a small wrapper) a lot more than I use ripgrep. It's even faster than ripgrep for me.


I'm always looking for new benchmarks. :) Do you have any easily accessible examples where git grep is materially faster than ripgrep? If not that's cool. Thanks!


I've not looked too far into it but my guess is that ripgrep being faster is just due to Gnu grep using a slower algorithm (and supporting unnecessary extensions). The Rust regex library excludes look-arounds and backreferences and is openly inspired by RE2. Russ Cox, one of the guys behind RE2, wrote something[0] on the topic.

[0] https://swtch.com/~rsc/regexp/regexp1.html


No, that's not why. I wrote about why: https://blog.burntsushi.net/ripgrep/

GNU grep doesn't have look-arounds either. It does have back-references, but that doesn't impact searches that don't use back-references.

I don't think there is really a concise way to describe why ripgrep is faster when comparing apples-to-apples. It depends on the queries and the corpus. The primary reasons are that it makes more efficient use of the hardware with algorithms that utilize SIMD.

If you do an apples-to-oranges comparison (i.e., "why is `rg foo ./` so much faster than `grep -r foo ./`), then the answer is pretty easily "because ripgrep uses parallelism and employs smart filtering by default."


Somewhat germane to the ripgrep topic, what's the reasoning behind failing to support simple wildcard searches? I've been playing with rg.exe under Windows for a few minutes now, encouraged by this thread, but have run into a common problem: https://www.grailbox.com/2018/08/restricting-ripgrep-to-cert...

Why don't most grep utilities just do the obvious right thing and support searches on .c* or .h or .txt or whatever? (HN is mangling the text, but you can read the above as "star dot c star," "star dot h," or "star dot txt.")


The reason is because someone hasn't worked on it yet. The reality is that Windows leaves glob expansion up to the CLI tool itself, where as the shell in a Unix environment does glob expansion before invoking the executable. On top of that, there appears to be no consensus on whether ripgrep should use standard Windows APIs to expand globs (which are much much less expressive than Unix-style globs), or to just implement its own globbing. There is a ticket tracking it: https://github.com/BurntSushi/ripgrep/issues/234

I've done a lot of work to make ripgrep work well on Windows. But haven't done this. You can usually work around it pretty easily with the -g flag. e.g.,

    rg foo -g '*.{c,h}'
or even shorter

    rg foo -tc


Thanks, soms interesting inside-baseball lore there. I should probably just suck it and get used to the -g syntax, I guess. It does seem like a superb grep implementation.


FWIW, you opinion about which glob syntax to use would be much appreciated. :)


I'm a Windows user, so I consider myself lucky that it works at all. :) As a class, we have been conditioned not to expect or demand too much from command-line utilities from the Unix family tree.

I'm basically looking to maintain the same functionality that was available under DOS with Borland's Turbo Grep in the late 1980s. Those old DOS function calls are still emulated in Win32 as far as I know, but a current implementation would normally use FindFirstFile / FindNextFile like so:

   C8 name_buffer[MAX_PATH];

   strcpy(name_buffer,dir_buffer); // directory to scan, if not CWD
   strcat(name_buffer,filespec);   // MS-DOS style filespec with optional * and/or ? wildcards (obviously use snprintf, etc. for this, not strcpy/strcat)

   HANDLE          search_handle;
   WIN32_FIND_DATA found;

   search_handle = FindFirstFile(name_buffer, &found);

   if (search_handle != INVALID_HANDLE_VALUE)
      {
      do
         {
         if (found.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY)
            {
            continue;
            }

         strcpy(name_buffer, dir_buffer);
         strcat(name_buffer, found.cFileName);

         // name_buffer now indicates a single file
         // which can be opened with fopen or 
         // CreateFile or whatever

         }
      while (FindNextFile(search_handle, &found));

      FindClose(search_handle);
      }
There's probably already some code like this in the area of the program that's used to implement the -g mechanism for types. The latter seems a bit overengineered when I'm just looking for wildcard expansion, but I can see it being useful in a lot of cases. It just shouldn't be the only way to constrain the file set, IMO. A new syntax isn't expected or desired, at least not by me, as long as the old one works.


Weird question for you: how much of ripgrep's awesomeness is you/product focus vs. the power of rust?

To put it another way, do you think you could've written ripgrep in another language? In C? In C++? In Go?


I think Rust is probably a force multiplier here. The same program, IMO, would be harder to maintain in a language like C or C++. Although of course there are alternatives to ripgrep written in C or C++, so I'm not sure how compelling of an argument that is. But the force multiplier doesn't just need to make me more productive. Rust makes it pretty natural to factor things into libraries, and developing ripgrep over the years has caused me to produce many of them. Those in turn have been reused by other tooling, including rustc and Cargo themselves. That is something you don't often see in the C or C++ world.

As for Go, I don't know. The garbage collector seems likely to be a problem. Ben Boyter wrote about working on a similarish tool in Go and problems with the GC: https://boyter.org/posts/sloc-cloc-code-performance/

At the very least, such a tool in Go would probably require writing your own regex engine if you want to get comparable performance. (You can see how similar tools in Go, such as pt and sift, fall off a performance cliff as soon as you lean too heavily on the regex engine.) Or at the very least, contribute back to Go's standard library and make `regexp` faster. Whether it can match the speed of a Rust/C/C++ regex engine, I don't know. It's an open question I think.


Thanks for the thoughtful response!

My own experience w/ Rust is limited to just kicking the tires implementing my favorite toy problem, but it was incredibly positive -- my solution ended up faster than my previous best (in C++) while also having extremely-straightforward, clean code.

(Something that particularly blew my mind was crossbeam_channel being (arguably) nicer than Go's built-in channels, and being able to spread work over threads w/ zero possibility of races -- that's some fuckin' cool shit.)

That experience, combined w/ observation of tools like fd and ripgrep, have convinced me that rust is actually unlocking a higher level of quality for software, at least in practice, if not in theory.

Your answer has not disabused me of the perception! :-)

> Although of course there are alternatives to ripgrep written in C or C++, so I'm not sure how compelling of an argument that is.

But they're not as good as ripgrep ;-)


> But they're not as good as ripgrep ;-)

I would naturally agree, but my users make that argument far better than I can. :-)


Genuinely curious how you noticed that this thread had referenced your work. And, ironically enough, a thread about text search!


For this thread, I just saw a thread that mentioned grep. At that point, I naturally gravitate toward it since I'm obviously very interested in the topic!


Pretty cool coincidence nonetheless! The reason I asked was because I noticed that your comment was within minutes of the parent comment which was a response to another comment about _your_ text-search project and how fast it is.

I said out loud, "Wow" and browsed through ripgrep. Nice work!

Was torn on whether you were tailing the HN api with ripgrep to alert you or if it was just the right place and time. I had a good time going through how that would work, so, thank you haha!

Occam's razor, right?


Hah. Nah nothing like that. Just right place right time. When you combine that with the fact that I probably check HN a little too frequently, it makes sense. :P

I do occasionally manually search HN for mentions of ripgrep. But that wasn't how I found this thread.


Thanks! As I said, I hadn't look too far into it.


Drum banging time: another good use for the UTF8 BOM.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: