Only if you count all natural looking data of which there is way more than you c...

swutas · on Oct 13, 2021

But this is technically an invalid answer to the problem of file compression. An easy (!) way for data to be losslessly compressed would just be to have a registry of all files in existence, and to refer to each file by its index in the registry. (Like the prisoners who had told each other all their jokes so many times that now they could just quote them by number and everyone would laugh.) Data compression is supposed to be able to deal with novel but likely documents without needing to add them to a global registry.

ganzuul · on Oct 12, 2021

I can imagine the output of program space. That's pretty big. All possible universes in fact.

It's an issue of probability and bins. While natural data is infinite, there is vastly more unnatural data. At some point you have enough metadata (e.g. natural vs. random) to know that the original data came from Earth to pick out the right needle from the needle stack.

Unless the data is from a completely alien source, we are close enough to that source for our heuristics to be of a manageable size.

bscphil · on Oct 13, 2021

I understand your confusion, because it took me a minute to think about it and confirm that this doesn't work. It might be surprising that even a machine with complete information and infinite compute power couldn't pick out the needle; after all, SHA-1 is a 160 bit hash. Are there really 2^160 ≈ 10^48 different texts in English that could reasonably have been written by someone living on Earth (add as much additional metadata as you like)? The answer is yes: there are simply too many needles.

Take your favorite book with 160 or more characters. (Probably a large book, but such a book could certainly exist.) Now, for each character, create version of the book with their name swapped out for exactly one other name which does not already appear in the book. So if a character is named Alex, in some other edition of the book that character is named Jason.

Now, consider how many editions of the book there are. There are two combinations (for each character) that are exactly alike other than that character's name being swapped out. So there are 2×2 = 4 combinations where two characters' names are swapped. And 2×2×2 = 8 combinations where 3 characters' names are swapped. If you think about how many versions of the book there are in total, it's 2^C, where C is the number of characters whose names can be swapped for only one other name.

In other words, it's guaranteed that you can generate SHA-1 collisions using one book alone, using only entirely reasonable alternative names for its characters.

Too many needles.

ganzuul · on Oct 13, 2021

Taking a book and randomly swapping words means you are introducing noise. This is high-entropy and really it's outside the scope of what I'm comfortable speculating on.

Image data is much simpler to think about in this paradigm, but the obvious AI applications of natural language are of course fascinating.

contravariant · on Oct 12, 2021

It's somewhat irrelevant how many unnatural data there is. The point is that there's way more plausible data (or sentences if we're restricting ourselves to natural language) than there are possible 160; 256 or 512 bit hashes.

Unless of course you enumerate all of human communication, but like I said that doesn't count as 'easy'.

ganzuul · on Oct 12, 2021

I'm not trying to claim this is practical with current technology.