Hacker News new | past | comments | ask | show | jobs | submit login
SHA-1 'fully and practically broken' by new collision (2020) (duo.com)
279 points by ofou on Oct 12, 2021 | hide | past | favorite | 197 comments



Reminder that GitHub has blocked Git commit collisions since 2017, and as far as anybody is aware hasn't seen one in the wild.

https://github.blog/2017-03-20-sha-1-collision-detection-on-...


Random collisions in 160-bit space are incredibly unlikely. This is talking about intentional collision, and means that it's entirely feasible for someone with significant compute power to create a git commit that has the exact same hash as another git commit. This could allow someone to silently modify a git commit history to e.g. inject malware or a known "bug" into a piece of software. The modified repository would be indistinguishable if you're only using git hashes.

Git's uses SHA-1 for unique identifiers, which is technically okay as long as they are not considered secure. If git were designed today it would probably use SHA2 or SHA3 but it's probably not going to change due to the massive install base.

Edit: anyone know if git's PGP signing feature creates a larger hash of the data in the repo? If not maybe git should add a feature where signing is done after the computation of a larger hash such as SHA-512 over all commits since the previous signature.


The defence used by GitHub specifically defends against these intentional collisions, not some mirage of random collisions.

Basically you collide a hash like SHA-1 or MD5 by getting it into a state where transitions don't twiddle as many bits, and then smashing the remaining bits by brute force trial. But, such states are weird so from inside the hash algorithm you can notice "Huh, this is that weird state I care about" and flag that at a cost of making the algorithm a little slower. The tweaked SHA1 code is publicly available.

If you're thinking "Oh! I should rip out our safe SHA256 code and use this unsafe but then retro-actively safer SHA1" No. Don't do that. SHA-256 is safer and faster. This is an emergency patch for people for whom apparently 20 years notice wasn't enough warning.

In theory the known way to do this isn't the only way, but, we have re-assuring evidence for MD5 that independent forces (probably the NSA) who have every reason to choose a different way to attack the hash to avoid detection do trigger the same weird states even though they're spending the eye-watering sum of money to break hashes themselves not just copy-pasting a result from a published paper.


So, if I understand correctly: the patched SHA-1 code generates the same hash, but is has checks on the internal state so that it will flag inputs which are likely to be intentionally colliding?



Git != GitHub


It is not yet possible "to create a git commit that has the exact same hash as another git commit" in the sense that if someone else has already done a commit you can make another commit with the same hash.

What is possible now is something that is much easier: if you have enough money and time, you can create 2 commits with the same hash, which start with some different parts, which may be chosen arbitrarily, then they have to include some parts that must be computed to ensure that the hashes will match and which will be gibberish that might be disguised in some constants or some binary blob, if possible.

Then you can commit one of them and presumably you can later substitute the initial commit with the other one without anybody being able to detect the substitution.


It doesn't take any money or time. Google's break of SHA-1 was fully reusable. So long as committing a PDF to the repo counts, there's a script that will trivially concat two PDFs in such a way as that they each render to their original (different) contents, but both have the same SHA-1 hash. Put in a repo and `git add foo.pdf` and you're done.


Nope. Google's break of SHA-1 was reusable in the sense that you can add an arbitrary (identical) suffix to both PDFs and keep the collision. But Git does not use raw SHA-1 hashes, it adds a prefix before computing object hashes. Therefore, Google's break of SHA-1 cannot be reused to break Git.


> presumably you can later substitute the initial commit with the other one without anybody being able to detect the substitution.

How? Which operation would be involved? Will it not show anywhere else(reflog)?


The reflog shows changes to references. References are hashes. Git is CAS, and uses the CAS assumption: H(m1) = H(M2) <=> m1 = m2.


What do you mean by "CAS"?


I think it means Content Addressable Storage - https://en.wikipedia.org/wiki/Content-addressable_storage

In short, it's a method of storage where object's identity is derived from object's content (usually via hashing it). So the assumption is: same hash => same content => same object.


By replacing the file on disk (you need to have the capability of acquiring access to the server)


Because such an operation is unlikely to succeed even if the hashes match, replacing SHA-1 was not a high priority for Git.


`git push --force` I assume. In GitHub, this will leave a trail of events though - Webhook events and auditing.


I would think it's not that simple because `git push --force` doesn't do anything if it thinks the histories are the same, and in this case we've created a history that appears to be the same. You'd likely need a custom git client (which isn't a problem) but I don't know enough about the internals of a git push to know if the server would even accept objects if they match hashes of objects it already has (it may just go "I don't need that" and ignore it, because what's the point in doing anything with it?). Presumably it would depend on the exact server implementation whether you could get it to replace an existing object with a "new" one which is actually the same hash, but frankly I think that's unlikely because it would be pointless work from the view of the server. If it does happen I'm not sure what auditing you would actually be able to see, webhooks and such might not be triggered because the history didn't actually "change".

What you could do however is just host it yourself somewhere else, say put it on a fork. Or if you have access to the actual repository hosting the original version, you could just manually replace it yourself. git clients aren't going to just automatically pull the "new" version though so you'll have some combo of people with the old and people with the new, and it gets a little messier from there.


If you can force push, why would you then not push a different commit before pushing your updated commit with the same, original hash, or does that also not work?


I wouldn't expect that to work reliably because git doesn't actively remove unused objects from the store, hence why you can do `git reflog` to go find stuff that's not actually referenced anywhere anymore. `git gc` is necessary to make them actually go away, and whether that ever happens and how often is up to the server. I know for example that Github practically never does it, and even if it did happen it would be hard to reliably ensure no references to the relevant object still exist anywhere on the server. For example, you would have to force push every branch on the server that references the object, and if the git server creates branches for internal use you might not be able to touch those or convince the server the object is unused. And even if all the references are gone if the object is never actually deleted from the server then it should just get used again if you try to do a `git push`.


By default Git keeps inaccessible/orphaned objects for two weeks before they are garbage-collected.


force push only updates a remote ref if it's not a fast forward. it doesn't do any forced pushing of objects.

you need to modify the object store manually, so access to the filesystem


I believe it let's you substitute an entire chain of commits based on some starting commit.


Looks like there's a migration plan for git to SHA-256:

https://git-scm.com/docs/hash-function-transition/


Last updated in 2017. What is the current status of this?


As of January 2021, SHA256 repositories are supported, but experimental. They can be created with `git init --object-format sha256`. If I understand correctly, they don't mix at all with SHA1 repositories (i.e. you can't pull/push from/to between SHA1 and SHA256 repos).

See https://stackoverflow.com/questions/65870508/git-and-sha-256


Here is discussion [1] on the issue on the Git mailing list at the time with some useful context WRT git. Unsure what current status is. Git 2.29 Had experimental support for sha256.. not sure what it’s current status is, but that was a year ago.

[1] https://marc.info/?t=148786884600001&r=1&w=2


I was surprise that no one suggested truncating SHA-256 to 160 bits (same as for SHA2-256/224, or SHA2-512/256). The attacks on SHA-1 are not directly based on the length of the hash, they are based on weaknesses in the algorithm.

Even attacking SHA2-256/128 would be quite difficult as I understand it, even though it's the same length as MD5.

Truncated hashes also of course have the great property that they mitigate the length extension in Merkle-Damgard


> Truncated hashes also of course have the great property that they mitigate the length extension in Merkle-Damgard

To be fair, this is totally irrelevant to git, since the attacker knows the whole message and can just recompute the extra bits themselves. That said:

> I was surprise that no one suggested truncating SHA-256 to 160 bits (same as for SHA2-256/224, or SHA2-512/256). The attacks on SHA-1 are not directly based on the length of the hash, they are based on weaknesses in the algorithm.

Very seconded. You could even shove the extra 96 bits in a optional metadata field and have new versions of git throw up a giant air-raid-siren-level error if they don't match (since that will never happen by accident[0]) and still have the full 256-bit-hash worth of security for most purposes. Git already allows (arguably encourages) people to truncate hashes to 28 bits or so at the UI level, so there's precendent for that already.

0: You do not have anywhere near 2^80 commits in the world, much less in the same repo.


> I was surprise that no one suggested truncating SHA-256 to 160 bits...

To do that, you have to stop generating commit hashes with SHA-1, breaking compatibility with existing git clients.

And if you're going to do that, you might as well just use the whole SHA-256 hash, since compatibility is out the window already.


Yep. GitHub is saying they would block an object that looked like it was crafted to produce a collision using SHAttered, and hasn't seen it, intentionally or otherwise, in the wild.


> This could allow someone to silently modify a git commit history to e.g. inject malware or a known "bug" into a piece of software.

You need a collision. You also need it to be syntactically correct. You need it to not raise any red flags if you are contributing a patch. And ultimately you need it to do what you want.

That's a pretty tall order.


All you need is one variable sufficiently large that you can change until you find a collision, such as a comment (which about all languages allow) or a meaningless number.

You could even vary whitespace until it fits, like spaces at the end of lines.


You'd also need the actual patch to survive future commits, especially without introducing any merge conflicts


Commits aren't patches. They contain the whole tree. Retroactively changing a commit can't possibly introduce conflicts with other commits on top of it, the worst it can do is introduce big funny-looking diffs.


It's true that semantically git commits store the whole tree but doing that naively would be inefficient. Instead, packfiles will store some objects as deltas which could either result in inconsitencies or noticeable knock-on changes if the original object contents are changed.


While that's true, I'd be very surprised if git delta-compressed the commit objects themselves. Changing a commit to point at a different tree wouldn't impact the delta-compressed packings of any file blobs, it would just change the actual file the commit points to.

For example, suppose you started with a commit graph that looked like this:

    C1 --- C2 --- C3
     \      \      \
      T1     T2     T3
       \      \      \
        F1 - - F2 - - F3
Where C1, C2 and C3 are commits; T1, T2 and T3 are the trees they reference; and F1, F2 and F3 are three versions of a file blob stored delta-compressed in your packfile. Then if you had a malicious version of C2 with the same hash you could replace C2 with a new commit C2' pointing at a new tree T2' with a new file object F2', and nothing would break. The resulting commit graph would look like this, and F1, F2 and F3 would all still be in your packfile delta-compressed and accessible, just with nothing referencing T2/F2:

    C1 --- C2'--- C3
     \      \      \
      T1     T2'    T3
       \      \      \
        F1 - - \  - - F3
                \
                 F2'
Regardless, this is all moot to some extent. The attack most everyone talks about is that if you were in control of a central git repository (for example if you were hosting a mirror of an open source repository), you could give two different versions of that repository to different people without them being able to tell, even if they were checking PGP signatures or referencing specific git hashes. For example you could serve the non-malicious files to human developers, and when a user-agent that looks like a CI/CD pipeline such as Jenkins or the Ubuntu/Debian/RedHat packager's build machine or someething clones the repository to build a specific hash requested by the user, give it a malicious version of the source tree that builds a backdoor into the binaries it creates. In this sort of attack you never have to "change" a git object on someone's machine which is something the git protocol naturally isn't designed to do because it never happens naturally.


> Commits aren't patches.

Indeed. I'd add that this is, in my, experience one of the biggest sources of misunderstanding for people new to git. It isn't helped by the fact that a lot of git introductions (well-meaningly) emphasize diffs between commits.

Darcs (http://darcs.net/) is an example of a truly patch-centric DVCS. While I think git is great and that its ubiquitousness has made the world better, I'm always a bit sad when reminded of what could have been with Darcs (for all its problems).


Well, the contents of the commit is a patch plus metadata. They point to a parent commit, and layer themselves in the tree.

The problem would be if a clone doesn't fetch the new version of the patch and generates a new commit that would conflict with the modified commit. You're changing the base all the future diffs are based off of. It might just jumble the source essentially corrupt the file, but I'm not sure.


Contents of a commit is not a patch, it's the whole tree. The got ui presents it as a patch, but that's generated at runtime by the "git diff" command. It does internally use delta compression to save storage space, but it's not necessarily a straight delta between a commit and its direct ancestor (and that's just an internal optimisation).


> If git were designed today it would probably use SHA2 or SHA3 but it's probably not going to change due to the massive install base.

There is some work going on to change this, but it's not an easy task:

https://lwn.net/Articles/811068/


https://stackoverflow.com/questions/23584990/what-data-is-be... says git signing just signs the commit hash (plus metadata: committer, commit message, etc).


> Random collisions in 160-bit space are incredibly unlikely

Just for fun: to get a 5% chance of a hash collision between ANY two numbers in an 160 bit space, you'd have to generate 3.9e23 hashes.

So you'd have to generate 1000 hashes per second for *12 trillion years.*

Formula:

n = sqrt(2 * 2^160 * ln(1/(1-0.05))

https://en.wikipedia.org/wiki/Birthday_problem#Probability_o...


From a practical point of view, how would injecting malware happen? If you're trying to insert a malicious diff somewhere deep in the git history, you would need to recompute all the other commits after the injected commit - which would most certainly change their commit ids too if they are touching the same file. When other commit ids change, the malicious change becomes detectable.

There's also the case for auditing: force pushing into an existing repo triggers an event in GitHub and is logged. While the logging event can be missed, it leaves a paper trail.

With things like reproduce-able builds, this also becomes harder. Distributing (through a means of a fork, or putting it up on a website mytotallylegitgitrepos.com) source code which builds into a binary which doesn't match upstream hash is suspicious.


Presumably the attacker would modify the most recent commit which edits the file that they are targeting. It is true that the attack becomes more difficult if you try to target an older commit.

Auditing helps if they try to force push the original repo, but doesn't protect vs someone redistributing malicious clones of the repo.

Reproduceable builds do help, but only for projects that can take advantage of it...


Every commit references every file. If you change the content of an old commit you would only affect people who check out the old commit. So this is utterly pointless and not what someone would do.

Instead what you would do is attempt to make a file-object that has a certain SHA1 hash identifying it, and a colliding file-object that has the same SHA1 hash. Then you are free to give people who clone the repository different file contents depending on when/who/how someone requests it (if the file content is hosted on github, how to change the file object identified by a given SHA1 hash is an additional hurdle since it's assumed to be immutable and indefinitely cacheable; if you control the host yourself you can just change it whenever you like).


As far as I know it only signs individual commits.


Can you create an intentional collision, and also have the counterfeit have compilable code? (Or, in other circumstances, intelligible English text?)


Yes, just like you can create an intentional collision, and also have the counterfeit be a PDF. [https://shattered.io/]


> A higher probability exists that every member of your programming team will be attacked and killed by wolves in unrelated incidents on the same night.

- Scott Chacon


This turns out to be wrong; for a 6-member programming team, that probability is about 2⁻²⁴⁵, which is about 2⁸⁵·³ times less likely than an accidental 160-bit SHA-1 collision: http://canonical.org/~kragen/sw/dev3/rpn-edit#3_8_0_1_0_0_0_...

Aside from being bullshit, it's also irrelevant, since we're discussing a collision being generated on purpose, not by accident.


Just to nitpick, I don't think that formula is valid. We're primarily interested in "unrelated" wolf attacks, but it counts the total fatalities, not the total number of fatal incidents. If we count each fatal attack as only one incident, regardless of the casualties, we get 2^-258 instead.

But of course we also need to take into account where the 6-member team lives. If they all live in West Bengal, India, the consideration is much different than if our developers live in Atlanta. Atlanta doesn't have any wild wolves. There is a Wolf's guenon in the zoo, but that probably doesn't count as a risk because they mostly eat small animals and also are monkeys.


This is why I always get mad when people say something like "you are more likely to be struck by lightning than eaten by a shark!".... we'll, that REALLY depends on where you are.


> that REALLY depends on where you are

There's only 10 fatal shark attacks per year. What's your calculation for the sharkiest area to live? It has to be something like 100+ times sharkier than average for your claim to be true. And keep in mind that half the US population can easily day trip to the ocean.

Edit: Actually, that's using a number of 2000 lightning fatalities which might be 10x too low. And lightning injuries are another 10x higher than that. So you'd need somewhere that shark attacks are a thousand or ten thousand times more likely than average. That's also without interpreting "eaten" literally...


You focused on the shark side and not on the lightning side. You will never get struck by lightning in a city, so there the probability is 0. Similarly if you go out swimming in the ocean on a sunny day the probability of getting struck by lightning is also 0, but the probability of getting eaten by a shark is not 0.


You're going too granular. This is about populations that live in certain areas, not specific people on specific days. "depends on where you are" when talking about an over-time risk ratio. The goal isn't to find some dude with weird habits.

And while a population that mostly stays inside a city has a reduced lightning chance, it also has a reduced shark chance. I don't think that's anywhere near the point of equalizing the rate.

If you have an idea for a factor that dramatically reduces lightning risk but doesn't reduce shark risk much, I'm interested.


0 is not a probability; trying to use it as one leads to division-by-zero errors.


How do you represent the probability of an impossible thing, then?


You could always be wrong about it being impossible. People do make mistakes sometimes.


It also depends on what you are doing. If you are spear fishing in shark filled waters, you have a much higher chance of being bitten by a shark.

And if I live in a place where there is very rarely lightning, that chance gets really low as well.


> If you are spear fishing in shark filled waters

Like I said in my other comment, it's not about finding one specific guy. Picking a place where people live is a reasonable starting point for making a rebuttal to a general statement like that. I don't think pointing at Fisherman Sam is.

> And if I live in a place where there is very rarely lightning, that chance gets really low as well.

How low can that number go? The numbers I picked made consideration for some amount of variation in lightning. Does lightning have a huge variance?

It's important to have real numbers when you're making the claim that there are places where the lightning:shark ratio is multiple orders of magnitude lower than the average. That's not a claim you can justify by merely pointing out that the risks will vary by ___location.


There was an organization devoted to reintroducing wolves to the northeastern US, but I don't believe they found it politically feasible.

However, I have seen (and photographed) something I called a coyote, but others insisted was a wolf.

It has been asserted that coyotes tend to evolve to resemble wolves, when there are none in an ecosystem.


Wolves are slowly coming back, they were recently taken off the endangered species list in Idaho, Wyoming, and Montana.[1]

[1] https://www.nps.gov/yell/learn/nature/wolves.htm


For whatever reason, American wolves attack people much, much less frequently than Eurasian wolves.


That's a good point! I didn't think about either of those issues. The second one raises the average team's chances of such a spate of unrelated attacks significantly, if anything related to this formula could be said to be significant, because the increase in risk to teams in West Bengal and Ukraine is enormously larger than the decrease in risk to teams in Atlanta.


i laughed so hard i cried, thank you for this

(to everyone involved)


Plot twist: two developers of your remote team live in Rome and have been raised by a female wolf. How to account for that?


This is problematic because it must be 2774 years ago, when wolve attacks were substantially more common. I'm also not thrilled that a good chunk of my dev team lives near the Lupercal. The one good piece of news is that at least one of my headcount is fated to die by the hand of man, which might bring the odds of simultaneous wolf attack deaths down to zero unless he quits beforehand.


I concur.


> Aside from being bullshit, it's also irrelevant, since we're discussing a collision being generated on purpose, not by accident.

I think you just pointed out the error in your own reasoning. This is defending against a deliberate attack. Therefore, your proper odds would be that your programming team is deliberately set upon by 6 different wolves. So, have they offended people who have access to 6 wolves, and the time and inclination to train them (or hire others to) in an effort to pull off a murder spree?

Edit: Actually, the hash attack already assumes motivation and skill. So, I don't know what the odds would have to be computed. That at least one programmer on your team could fight off a trained attack wolf (to whatever level of "training" is the current state of the art for attack wolves)?


This is stupid. The probability of any attack happening at random is close to zero. E.g. the probability of a buffer overflowing just right to give you a shell through random chance is less than being eaten by wolves.

The question is about the risk of someone intentionally performing the attack, not the probability it will accidentally happen at random.


That's even true if you make your developers commute into Yellowstone National Park by ski every day


For the ones interested, you can use SHA-256 today. Warning: Still in experimental mode.

    # .gitconfig file
    [extensions]
        objectFormat = sha256


Does GitHub support this?


Not yet [1]. "However, conversations around its implementation are ongoing and looks like it will be on the horizon."

[1]: https://github.community/t/support-for-sha-256-hashes/157493...


> The technique that the researchers developed is quite complex and required two months of computations on 900 individual GPUs.


I wonder how much money in Bitcoin they would have made if they had used those hashes to mine instead.

Speaking of which, it would be funny if they were pretending to do security research but added a secret backdoor to mine bitcoin instead, somehow exporting those hashes or using the partial results of SHA-1 calculations (BTC isn't SHA-1).

I'm just joking, but I wonder if that's possible. If anyone is Machiavellian enough for that, it's security researchers.


Well, they made 2.4994722 BTC directly by exploiting a smart contract on bitcoin that pays out to anyone demonstrating a SHA-1 collision.


mining bitcoin on one GPU is uneconomical, so mining bitcoin an a large number of them is... also.


Unless your research grant pays for the electricity


Ding ding ding


[flagged]


> When disagreeing, please reply to the argument instead of calling names. "That is idiotic; 1 + 1 is 2, not 3" can be shortened to "1 + 1 is 2, not 3."

> Be kind. Don't be snarky.

https://news.ycombinator.com/newsguidelines.html


Or roughly 1.3M GPU hours. Less as GPUs get upgraded. Not sure how many AWS F1 (FPGA) hours it would take.


So somewhere between $300,000 (running consumer GPUs cheaply) and $30 million (paying AWS on-demand prices for Tesla V100).

Even the upper bound for that seems well worth it for well-funded attackers if the target is juicy enough.


Consumer GPUs could be run from a botnet, actually... Yikes.


I only half joke when I ask, when you break a cryptogrpahic hash, does it mean that now it's just a really amazing compression algorithm, but with a very heavy compute requirement? The non-joking half is speculating what data compression in a post-quantum compute world looks like.


The compression joke evokes a misconception (as STEM jokes often do) which, in this case, is that breaking a hash means finding a way to recover the full text of a multi-megabyte document, from a 100-something byte hash.

Whereas all it means is that it's become feasible for someone to produce a fake document with the same hash as the genuine one; the attack depends on the existence of collisions, which depend on the hash being one-way, such that an infinite set of documents could be "recovered" from the hash.

Most of the documents you could "decompress" from a hash are going to be gibberish, but some of them are going to be legit. If a hash is "broken", that means it's somehow unduly easy to search that ocean of gibberish collisions to find a chosen text which goes to a certain desired hash.

Nothing quantum can redefine a function that isn't one-to-one and onto into one that is. The hashing function remains ever the mathematical object that it is: a one-way function. But calculations of or searches through that function can be faster.

In a post-quantum world, pre-quantum crypto and hashing could well turn out to be "for shit" as such, but not due to ruses like suddenly reversing irreversible functions.

Not saying that the joke isn't funny! We can imagine some hard-core computer science or math sitcom in which the fall characters say things like this, and only the very narrow target audience gets it.


These hashes, regardless of their actual cryptographic strength, are intended to be indistinguishable from a random generator keyed with the input, kind of. Assuming that they succeed in that, hashes should be quite well distributed. That means that that for each hash n-bit hash output, there will be about two (n+1)-bit input strings that produce that hash, about four (n+2)-bit input strings, eight (n+3)-bit input strings, etc. So while there are probably few ASCII input strings that map to one particular hash, for example, there will be loads and loads of arbitrary bitstrings that map to that hash.

Hashes are very bad compressors. :)


Compression algorithms must encode data in a reversible way, which hash functions can't do in general.

For example, SHA-1 outputs a 160 bit hash, which means some input of 161 or more bits will definitely have the same hash as some other input of the same size. Even smaller inputs may have collisions too.


No, because reversing the hash has infinite possible answers (well, "very many" for bounded input size).

You can't decompress 32 bytes into 1GB because you don't know which of the 1GB-sized answers is the intended one.


Not so simple. A hash corresponds to infinitely many messages, but how many of them are in ASCII charset under 1KB long? It may happen that each hash has a unique message within these constraints.


Ignoring punctuation, capitalization and spaces, you have

(26 ^ 1024) / (2 ^ 160) ascii texts, which is too large for python to do the division.

Thinking harder, that's:

>>> (13 * (160)) * (26 ** (1024 - 160)) 586015382205826960727672544401463871212215837535709025412643135464856100047507808667927549205890681326798481562717865671371914861707033271010401105757243098374422354323280335989456977123883814519788789676409601028611595593846201622073508573113400508407626532918150795408521721900638322896979964584478287370459866751451622413984067802115006844002721198098126119299044037305285971873899685570443731713584345786693414215895940310733413089663439828136700053588516077484775302179979357918243214847406310224703621807862576801557249657421436718532619781248154845297450962161296162654285951879462181582579086975207170674079727507724939279192338763731562331681240184746490930657625088527012496974579965412172847315257089070105214466289197177909463176672954181773739345690793807307314187284671422149249630372138364687234076394633405015375177710973043617082884089261346079718819448519032937091364439564690325014717871376537803759701695837030040255807318892621674241958398030180530294671920834183657715967728968579955163053146309746677182369005157209730333605691785063279407829717681098919238400049142795696195708090541066855775773950462556018168019660329830112434789524857886670766484791538540247307849949305926757172442880517278313785940979709167536140659937329357336156000509320395656047604973761134700992358740522602445520136787846575269569119611476145342798153931437530464566322510167663230985347176517861376

If you restrict it to words, assume that there's only 1000 words in the English language, averaging 5 letters each, you get:

(1000 ^ 200) / (2 ^ 160) which becomes:

>>> (500 ** 160) * (1000 ** 40) 684227765783602085411977335590779360976690401306892466678255997993062052092705371819647552911192178726196289062500000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

There will be massive numbers of realistic texts that match any given hash.


Good point. I've realized that 2^160 can be just 160 words, each word is a choice of two synonyms.


The pigeonhole principle forbids it, since the hash itself is an ASCII message under 1KB long. Even within your message constraints, the message can contain contain any possible hash, and more besides. Therefore there are more valid messages than hashes.


It's very unlikely for a 160-bit hash to have a unique message under the constraint of it being "in ASCII charset under 1KB long". 1 KiB of printable ASCII including \n and space but not \t or \177 is 96¹⁰²⁴, which is about 2⁶⁷⁴³. Of these, if the hash function is any good at all, almost exactly one out of every 2¹⁶⁰ will have the right hash, leaving you with an expected 2⁶⁵⁸³ distinct ASCII-charset messages under 1KB long. If I calculated it correctly, instead of one unique message with the right hash consisting of under 1 KiB of printable ASCII (possibly with some spaces appended to pad it out to 1 KiB), you have about 47 967 084 939 617 240 088 231 752 005 306 325 884 757 564 806 051 499 378 420 486 072 786 141 026 711 644 389 624 345 361 644 245 593 155 700 554 795 067 612 226 300 255 186 367 434 267 198 328 891 304 541 335 537 513 836 547 400 045 409 787 642 617 075 359 474 632 262 281 099 933 405 444 693 275 797 636 981 359 166 142 551 935 433 456 776 135 242 574 665 937 013 670 423 995 769 980 893 248 768 962 950 653 929 460 806 907 564 741 845 059 869 794 472 249 820 375 550 104 552 506 227 318 738 353 617 584 549 423 076 513 640 935 785 892 067 652 844 586 446 661 520 922 590 520 435 212 971 778 420 191 665 135 704 785 297 868 496 132 194 344 092 532 767 753 539 023 500 656 559 945 585 667 295 477 775 543 847 979 594 651 751 709 542 229 125 925 151 633 946 693 073 932 350 683 231 283 773 613 582 005 763 154 838 864 288 879 888 332 953 553 058 126 791 192 014 388 230 316 497 633 811 926 064 371 458 658 795 966 440 990 101 901 032 836 632 802 434 803 173 156 709 301 006 193 682 249 500 596 835 881 601 627 616 029 997 922 141 423 971 635 916 546 397 289 367 953 405 102 342 661 998 428 622 018 161 995 405 622 814 061 925 210 174 108 063 042 331 122 962 491 864 135 145 051 181 277 545 303 147 381 365 523 778 185 116 460 815 262 263 500 402 481 608 602 600 186 833 670 768 942 448 591 646 439 127 301 400 601 336 130 050 539 973 301 764 212 409 825 855 458 107 341 401 691 305 152 002 782 886 099 703 975 355 852 347 383 236 360 931 712 288 039 396 729 525 475 540 725 954 319 379 177 826 924 167 807 020 037 630 409 522 341 596 693 199 836 065 989 968 868 685 185 086 087 022 273 317 809 678 407 361 940 533 280 300 173 079 635 303 309 555 449 410 556 692 415 681 998 421 397 202 738 917 214 037 435 078 427 359 778 923 359 900 840 055 529 921 692 702 321 542 997 864 543 418 915 511 989 871 945 334 709 916 341 623 655 699 199 160 379 312 900 066 054 592 963 465 517 593 966 290 755 564 533 465 052 063 440 394 099 369 011 964 754 135 849 017 025 315 982 367 652 117 408 473 872 192 766 757 909 754 922 584 465 489 973 138 378 312 812 178 366 379 361 777 540 778 799 973 694 388 412 249 147 549 245 795 058 816 022 744 426 339 281 128 551 995 761 983 244 680 295 344 226 171 135 266 747 723 635 275 591 506 459 047 751 124 514 571 326 309 610 529 999 309 895 628 777 141 756 979 076 770 232 849 272 425 879 893 429 411 541 942 495 076 207 596 350 607 674 491 546 842 819 722 368 514 723 670 390 388 258 639 701 760 252 716 026 232 912 354 822 929 844 920 997 401 287 086 389 559 396 695 747 881 419 688 945 588 175 366 178 512 747 036 422 014 697 873 269 878 782 396 373 076 554 780 366 468 559 467 108 175 337 875 528 158 025 875 456 of them. (Sorry for the number format with spaces; it got chopped when I used commas.)

Most of them will be gibberish; English text has only about 2.3 bits of entropy per letter. So there are only on the order of 2²³⁵⁵ 1KiB messages that look pretty much like English text, of which one in 2¹⁶⁰ will have any particular SHA-1 hash, leaving about 2²¹⁹⁵ more or less English messages of 1 KiB, which is a much more manageable 5 765 546 543 805 391 543 300 385 245 067 897 407 469 008 380 207 694 434 170 981 314 369 415 226 086 245 896 005 497 410 349 176 911 651 361 357 544 908 126 864 379 940 776 407 262 468 025 247 520 821 365 392 566 254 691 849 336 550 399 984 742 144 883 696 325 495 839 942 505 506 308 529 294 485 245 435 346 088 288 415 306 782 152 045 986 880 430 505 821 218 111 120 701 594 573 419 855 327 199 586 861 839 630 511 065 600 663 692 968 681 473 384 074 002 850 142 261 291 497 547 545 795 867 600 142 345 188 353 358 006 378 705 229 284 788 565 040 964 509 510 302 568 387 814 225 873 737 552 804 109 763 080 706 434 267 888 314 149 674 523 819 024 312 546 837 031 915 917 556 591 511 424 773 862 591 940 658 144 814 461 877 029 111 670 089 356 835 845 931 924 493 084 507 666 309 424 365 148 038 224 615 440 025 478 945 269 023 101 615 392 514 882 287 817 384 451 162 838 663 168 messages.

It is unlikely that it would be apparent upon inspection which one was correct, since most of them differ from other messages by only a couple dozen subtle changes.


Natural data is easily identifiable.


Not it's not. You can't distinguish the complete works of Shakespeare from the complete works of Shakespeare with the words changed a bit, or switching to Harry Potter for a bit in the middle, or a completely dazzling and original work of fiction. Any hash might generate all three of these, and essentially boundless other works. It's a library of Babel.


Your examples really fall out of the scope of the premise.


Forgive me if I've misunderstood - the premise was that, out of the infinite set of data that hashes to a particular hash/checksum, there is a unique data set that is "obviously" the real one? Or at least, that the set is meaningfully bounded? My reply is that this is not the case. There will be infinitely many "plausible" data sets as well.

You could collide every hash in existence merely by making undetectably tiny alterations to Shakespeare.


Not obviously, no. - Just a very high probability. Perhaps that provides noise immunity to some degree. If it does, it is a form of AI.

This admittedly open question presumes a very large fuzzy 'code book' with which it can re-assemble the data. The length of the input in cleartext is valuable metadata that speeds up the search.


You still seem to be missing the point. The problem is that the SHA-1 hash is one of 2^160 possible values, while a 1GB plaintext is one of 2^8000000000 possible values. That means that the hash disambiguates your message down to one of 2^7999999840 values.

Let's go smaller. Let's say our plaintext is a single kilobyte. One 80x25 terminal's worth of ASCII. Knowing the SHA-1 hash narrows our search space, yes, but the search space is so absurdly large that it only solves 0.0000... (insert over two thousand zeros here) ..0001% of the problem.

It tells you basically nothing.


No, they're 100% correct.


Nope, not on these scales.

Say you have a 32 byte hash, that's 2^(328) possible values, right? Now say your input is 128 bytes, that means that (on average) each 32 byte hash maps to 2^(968) values.

"Natural looking" data will occur an astronomical number of times in a search space that large - which is, again, only 96 bytes larger than your hash.


Only if you count all natural looking data of which there is way more than you could possibly imagine.

It is of course likely that humanity hasn't yet created more than 2^100 (~10^30) files, so in theory given a registry of all files in existence you might be able to identify it by its hash. However while this is simple it's definitely not easy.


But this is technically an invalid answer to the problem of file compression. An easy (!) way for data to be losslessly compressed would just be to have a registry of all files in existence, and to refer to each file by its index in the registry. (Like the prisoners who had told each other all their jokes so many times that now they could just quote them by number and everyone would laugh.) Data compression is supposed to be able to deal with novel but likely documents without needing to add them to a global registry.


I can imagine the output of program space. That's pretty big. All possible universes in fact.

It's an issue of probability and bins. While natural data is infinite, there is vastly more unnatural data. At some point you have enough metadata (e.g. natural vs. random) to know that the original data came from Earth to pick out the right needle from the needle stack.

Unless the data is from a completely alien source, we are close enough to that source for our heuristics to be of a manageable size.


I understand your confusion, because it took me a minute to think about it and confirm that this doesn't work. It might be surprising that even a machine with complete information and infinite compute power couldn't pick out the needle; after all, SHA-1 is a 160 bit hash. Are there really 2^160 ≈ 10^48 different texts in English that could reasonably have been written by someone living on Earth (add as much additional metadata as you like)? The answer is yes: there are simply too many needles.

Take your favorite book with 160 or more characters. (Probably a large book, but such a book could certainly exist.) Now, for each character, create version of the book with their name swapped out for exactly one other name which does not already appear in the book. So if a character is named Alex, in some other edition of the book that character is named Jason.

Now, consider how many editions of the book there are. There are two combinations (for each character) that are exactly alike other than that character's name being swapped out. So there are 2×2 = 4 combinations where two characters' names are swapped. And 2×2×2 = 8 combinations where 3 characters' names are swapped. If you think about how many versions of the book there are in total, it's 2^C, where C is the number of characters whose names can be swapped for only one other name.

In other words, it's guaranteed that you can generate SHA-1 collisions using one book alone, using only entirely reasonable alternative names for its characters.

Too many needles.


Taking a book and randomly swapping words means you are introducing noise. This is high-entropy and really it's outside the scope of what I'm comfortable speculating on.

Image data is much simpler to think about in this paradigm, but the obvious AI applications of natural language are of course fascinating.


It's somewhat irrelevant how many unnatural data there is. The point is that there's way more plausible data (or sentences if we're restricting ourselves to natural language) than there are possible 160; 256 or 512 bit hashes.

Unless of course you enumerate all of human communication, but like I said that doesn't count as 'easy'.


I'm not trying to claim this is practical with current technology.


That's because natural data has low entropy

But let's say every paragraph only offers 1 bit of entropy. Then a 160 bit hash gives you fuzzy accuracy up to 160 paragraphs. After that you'll have to extend the hash with hints to guide which sequence of paragraphs you're looking for, & hints for where the typos are

ofc, 100x compression of English text doesn't require this amount of compute to decompress: https://en.wikipedia.org/wiki/Hutter_Prize

It's also impractical since most compression use cases want to put the work in compression, & have decompression be straight forward

edit: 100x was misreading, it's currently 8.6x http://prize.hutter1.net/#prev


This is very abstract, but I believe that since both input and output are very close in program space compared to the total size of PS, the heuristics to map between them are going to be of a manageable size. This is based on an intuition about a single source for all activity in this universe.


"Paragraph" is highly optimistic. IIRC, each English character has about 1 bit of entropy.


Shannon estimated 0.6 to 1.3: https://cs.fit.edu/~mmahoney/dissertation/entropy1.html

For the sake of argument I figured I'd be highly optimistic. The linked prize shows practical evidence of 1GB of Wikipedia being compressed below 1 bit per character (& that's with a limit of 50 hours cpu time, 10GB memory, & 100GB disk)


It's not, and that's why decryption of unknown ciphertext is even harder


This is what made it a funny thought experiement to me. I was doing some space related stuff a while back and when you're dealing with something as small as 100mb, and you add the constraint of up to a light-year or more of latency, which makes error correction immensely costly, having a quantum processor reciever find the collision of a unique hash could be faster than transmitting the data with errors.

There's probably a distance where the latency in light years is longer than it takes for a quantum processor with some shortcuts to exhaust the search space for the collision on the hash - and potential hashing algorithms that contain information and hints about a hyperplane that generates the original data over the field of possibilities. To a quantum computer with sufficient qbits at a long enough latency distance away, the 32-bit hash of a 2048bit RSA key may "arrive" faster than transmitting the whole key.

It feels like a fundamental misunderstanding of something in communication theory, but this was the thinking that prompted it.


Using the pigeonhole principle, consider how many 100 megabit values must, on average, hash to the same 32 bit value. The answer is a huge number (I could be wrong but it seems to me it might be 2 ^ (100,000,000 - 32)?). This is a bit of a paradox in cryptographic hashing, where finding a collision is difficult, but the number of collisions is enormous.


Yeah, you're fundamentally missing what I wrote, as well as pigeonhole principle and the basics of information theory.

Let's say you replace the quantum computer with an oracle. You have an infinitely long dictionary, and you can turn to any page and see your hash, as well as every possible input that could form that hash. There's billions and billions of options that all match your hash - how could you pick which one is definitely the "right" one?

You're no closer to knowing what the transmitted message was, because of the astronomical number of inputs that you now have to sift through.

[ Technically, you're (length of hash in bits) closer. Say you have 10 bits of hash, or 1024 values. There are (on average), 2 11-bit inputs that create that hash, 4 12-bit inputs, 8 13-bit inputs, and so on, following the powers of 2. At input of N=30 bits, you have 2^20 =~ 1 million possible inputs with that exact same hash. 1KB? 2^990, or "one followed by 990 zeroes". ]


This part about a massive number of possible collisions for a given plaintext in a cryptographic hash, but just spread over such a large field and incrememting with the input size so that it is just unlikely you will find another one is the counterintuitive part. I haven't implemented a secure hashing algorithm from scratch, so this aspect would have come out in the exercise.

Even working with security and crypto, the presumption of a cryptographic hash from a standard algorithm is that it is universally unique for the data it hashes - barring someone finding collisions, and then it gets deprecated. Most people deal with them at a higher level of abstraction.

Hash collisions have been used in a few high profile attacks in the last decade, most notably when some SSL certificates were still using MD5, and the most recent smart contract issue where the contract was only validating the last 4 bytes of a cryptographic hash and someone produced a key that exploited that with a collision. Unfortunately the people who don't make those mistakes are both very rare, and often dynamic to work with.

The idea that there are many, many potential collisions for a given input to a cryptographic hash doesn't come up much unless someone manages to find single a collision in one. I'd even posit nobody outside academic cryptography circles is including sha256 collisions in their threat scenarios right now.

However, the absurd/imaginary/hypotehtical/counterfactual I was raising is that a future quantum computer can produce all outputs of that length in a reasonable amount of time. The question was more about what was sufficient to reconstruct the data, and it was funny to know something was specifically wrong but not know why.

However, I appreciate the time taken to illustrate it.


Fortunately, Fossil can now use SHA3-256 as well as SHA-1 (even in the same repository if necessary, perhaps due to upgrading an older repository), and I think it also has a hard mode to detect if SHA-1 collisions seem likely.

(I think git doesn't allow a repository to have multiple kind of hashes; from what I understand, only a single algorithm must be used.)

Since SHA-1 and SHA3-256 have different hash length, you can tell which one it is. It doesn't work so well if the length is the same; one way to fix it would be use a multicodec prefix. (My own implementation (currently incomplete) internally uses the multicodec numbers to identify the hash algorithms, but these multicodec numbers are never stored in the repository; instead, it is only applicable for the argument for the function to compute the hash, which can also be used in other programs.)


Unfortunately Fossil uses a fast SHA hash for the password hashes of user accounts in the database, rather than a key derivation function, which is disappointing.


My fossil knowledge might be a fossil itself, but isn’t the database used for everything?

If someone gets hold of the hashes, they already have everything. So, whats the threat that is enabled by poor pw hashing?

Anyways, passwords should always be hashed with good password hashing functions (not all kdf are good for that) even if it is not strictly necessary. Just in case.


My idea is a different hash construction, which is 2D construction, which has a infinite internal state and infinite output length. Each row and each column also has a sequence number input, and then there are two mixing up functions (each of which has a finite input and output of equal size than the input); part of the result is propagated horizontally and part of it vertically, so each cell has two inputs and the input includes both the message block and the sequence number, which overwrites a part of the internal state (the rest of which has been mixed with the other state). Each output block (the output blocks are vertical and the message blocks are horizontal, which is why it is slow) is then truncated and concatenated to produce the final output, after adding padding at the end of the message (including the length of the message) to mix it up even more. If the message length is m and the hash length is h, then it requires O(h) space and O(mh) time. If ChaCha20 is in use, then it is easy to see (by the kind of calculations made by ChaCha20) that if the input is zero then the output will also be zero; therefore, the initial sequence number should be 1 and not 0. You can also compute a secondary hash which must be misaligned with the first one, so that in case a collision is found in one block that the collision will not be propagated with the other blocks too (unless a collision can be found within two consecutive blocks, in which case you would hope that the extra dimension (since it is a 2D construction) can avoid the collision).


Reading your post I’m sure you know far more about information theory than I do, so forgive me if this is a stupid question, but…

1. How is infinite internal state and output size possible? There has to be some actual limit for internal state, at least, right? Or else you’ll just run out of memory?

2. Wouldn’t a larger output size risk leaking data? The chance of collision becomes lower, but it also seems to toy with what I understand about why we use cryptographic hashes without ridiculous output size.

3A. What happens if you want a 1 MB hash of a 1 KB file?

3B. How predictable does a super long hash become?

At what point is it no longer a one-way hash?


1. Because you will probably be truncating the output to a short finite hash, and because each block of internal state only leads in one direction (it affects all later ("up") blocks of internal state for all later ("right") message blocks, but not previous ("left" and "down")), you can optimize it by only implementing the part that you need. So, yes, an actual implementation will be limited, but if you have enough memory and enough time then the limit can be as high as you want to be. (Maybe a diagram might explain it better; I am not sure that this explanation is any good.) (SHAKE also has infinite output (but finite internal state), and when using SHAKE also you would truncate the output to a finite size.)

2. A larger output size might risk leaking data, although I would think it would be difficult.

3A. Like I described, it then needs O(1MB) space and O(1GB) time, so it will be slow. However, I do not expect you should need a hash that long.

3B. I don't know; probably about as predictable as any random number generator, if the hash is designed correctly. (I only describe a construction, and the hash algorithm design involves more than that.)


Git was created 16 years ago. The impending breakage of SHA-1 was known even at that time, just like how MD5 had been broken before it.

I'm honestly still shocked that updating the hashing algorithm wasn't built into Git from day one. I really wonder why. Did people think this wouldn't happen? Were they so in love with the performance of C/C++ being able to pass around 20 byte hashes on the stack without worrying about a more complicated structure (eg a collection of variable length hashes)?

It just seems like such a massive and foreseeable oversight that's going to cause pain for some time to come for really no reason at all.


SHA-1 will still work fine for the purpose of git. It is just no longer considered secure for cryptographic operations, such as digital signature, that doesn't mean that you can't use it for other purposes, like git does. Using it is still fine and will ever be fine.

Making the hashing algorithm exchangeable would have introduces complexity in a software that is already complex, and also less efficient (one of the reasons git was created was speed for large project like the Linux kernel) for no real purpose. If you want to change the algorithm, given that you will break compatibility with all existing repositories, tools, and clients, you make a fork of git because you are changing too much.

I don't see why migrating to SHA-256. The collisions are still very unlikely to generate accidentally, and sure, if you want to do damage to a repository you can create one on purpose, as you can as well commit some hooks that contains malware, or alter the git history in any way you want, so what's the point?


Collisions definitely do matter for git security: many people pin explicit git hashes for their dependancies, and thus they can be tricked in running malicious forks. This requires placing a chosen commit in the git repo (so unlike second preimage break it does not mean that you could attack repos you have no control over) but that's not an unrealistic threat model overall.


What is the thread model though?

I don't think it's possible to create a collision that's also executeable code which adds a security hole or anything.

So what exactly would they achieve with the collision?

And how do they push these gigantic files that have the hash collisions to a server? The upload time would be significant.


1) People systematically underestimate the possibility of creating collisions that still do something "interesting", like being polyglots (files that can be interpreted in multiple formats, executable or otherwise). See PoC||GTFO, specifically anything by Ange Albertini, for examples; grep https://github.com/angea/pocorgtfo/blob/master/README.md for "MD5". I specifically recommend this writeup: https://github.com/angea/pocorgtfo/blob/master/writeups/19/R... .

1bis) You can use an existing collision to create new collisions. People seem to think you need to generate all the work again from scratch; this is not true. See PoC||GTFO for proof by example.

1cis) The files do not need to be gigantic. See PoC||GTFO for proof by example.

2) You can do the collision in advance, and publish the malicious version later. What it accomplishes is that the concept of "this Git hash unambiguously specifies a revision" no longer works, and one of them can be malicious.

3) The standard should be "obviously safe beyond a reasonable doubt", not "not obviously unsafe to a non-expert". By the latter standard, pretty much any random encryption construction is fine. (The examples I gave use MD5, not SHA-1, but that's a matter of degrees.)

4) SHA-256 was published years before git first was.


What do you mean by the `bis` and `cis` suffixes to your entry labels?


It's just a subdivision; it might as well have said 1a, 1b... -- but "bis" and "cis/tris" (and possibly tetrakis) tend to emphasize that they're addenda, not equal points.


It should normally be "bis" and "ter".

The Latin for "once, twice, thrice, four times, five times" is "semel, bis, ter, quater, quinquies". ("Bis" and "ter" are the only really short ones.)

It's moderately common in European standards and bureaucracy to use "bis" and "ter" for "version/revision 2" and "version/revision 3", respectively. For example https://en.wikipedia.org/wiki/List_of_ITU-T_V-series_recomme...


Huh, good point; I wonder if my mind mixed up the org chem with the numbers (likely) or if that's some kind of unique Belgian affectation.


Ah, gotcha. I thought I recognized the prefixes from Organic, but I don't think I've seen it used like this here. Neat!


The possible attack is to prepare 2 versions of a commit, both resulting in the same commit id. Then later on, after the project is successful/etc, swap out the commit with the second version, while keeping the other commits intact.

Granted, the file that the commit touches would need to be not touched in other commits. That's not out of question in a typical software project - maybe a file in the utils folder which is only written once and never changed?

> I don't think it's possible to create a collision that's also executeable code

You can include an unreadable binary blob in the commit. Tweak the blob to find the collision while keeping the code the way attack requires.


> swap out the commit

What's the method for doing this? Does a "git push" replace objects with identical hashes on the remote? Or a "git pull" replace identical hashes on the local repo?

I suspect finding a hash collision is only the first difficult part of actually pulling this off. You may need direct write access to the file system of the target. And even then everyone else that has already fetched the repo may not be impacted. At which point collisions becomes moot because you can rewrite the entire git history however you want.


The history teaches us: If any system isn't hardened against something, we can assume it's possible. If Git server isn't specifically hardened against that, it might still be tricked to update the file by adversary client. Or attacker can temporarily add hooks that will replace the file on server. Or integration testing system might have write access to the server repo.


> Granted, the file that the commit touches would need to be not touched in other commits.

That's not how git works. The commit contains the entire tree. You could prepare two separate repositories such that `git checkout deadbeef0001deadbeef` in one checks out the linux kernel and in the other checks out ILOVEYOU.exe.


You're right. Commit id points to a commit object, that points to a tree object and subsequently to individual blob objects. Then it is sufficiently harder, you need to find a collision between 2 blob objects, both of which are executable and don't look suspicious.


That one is nasty.


The files don't need to be gigantic. You could, for example, have a binary config file which in one colliding version encodes a potentially dangerous debugging setting, e.g., "allow_unauthenticated_rpc = false" but in other has it to "true".


A denial of service of sorts? (Something broken and unusable is delivered instead, as distinct from something usable but maliciously so.)

I agree that the chances of ever getting a second pre-image that not only makes sense, but does so in some malicious way may as well be zero, surely?


The part about the 2nd pre-image chances being effectively zero may be true, but the nasty cases described upthread don't need a 2nd pre-image. (You could do a lot worse with one, granted!)


In addition to the attack described in a sibling comment, when a hashing algorithm has been broken in some way, it is safe to assume that other more advanced collision attacks will be soon discovered.


> It is just no longer considered secure for cryptographic operations, such as digital signature, that doesn't mean that you can't use it for other purposes, like git does.

Git effectively uses its hashes as a component of a digital signature scheme, in the form of signed tags and signed commits. The GPG signature covers the commit object, which identifies the content (tree) by its hash.


> SHA-1 will still work fine for the purpose of git. It is just no longer considered secure for cryptographic operations, such as digital signature, that doesn't mean that you can't use it for other purposes, like git does. Using it is still fine and will ever be fine.

This suggests git does not rely on its hash for security properties, which seems false? What is the purpose of pinning revisions or signing git tags?


Honestly I find these rationalizations around the use of SHA-1 annoying and counter-productive. The rule is simple: don't use SHA-1. If you already use SHA-1 migrate away from it. You know that plenty of software out there that interfaces with git expecting that the commit hash will be unique. Is it a security risk? Maybe, maybe not. I don't care to find out.

It doesn't matter until it starts mattering. If the Git devs had done the right thing over a decade ago we wouldn't be having this discussion. The longer they wait the more painful the migration will be.

SHA-2 was published in 2001, git was released in 2005 and now we're in 2021 and we're having this discussion again. The first concrete attack was released in "early 2005" according to wikipedia, so there's really no excuse.

Just do it, make a major change where you replace SHA-1 with SHA-256 and call it a day. It's going to be painful for a few months and then we'll move on.

For me these discussions demonstrate the immaturity of software engineering. In other industries regulators would've banned the use of SHA-1 and you couldn't get certified if you used it.

Do electronic engineers regularly try to argue "well ok RoHS says we can't have lead solder in this product but frankly for this one it's fine the casing is completely waterproof and there are no risks for the customer"? No, they don't. If the spec says no lead, then either it's no lead or you can't sell your product. End of story.

SHA-1 is the lead solder of software engineering. Only acceptable for military use.


> You know that plenty of software out there that interfaces with git expecting that the commit hash will be unique

You also know that plenty of software out there that interfaces with git has hardcoded assumptions (like, for example, the assumption that the commit hash will be exactly 40 characters long). Some tools parse the output of git log and other commit-bearing commands to make decisions. Will changing git to SHA-256 create new unforeseen security risks due to breakage of those tools (for example, by only grabbing the first 40 characters of a SHA-256 digest instead of all 64 or by just outright crashing)? Maybe, maybe not.

IMO I think you would create more security risks with the git integration breakage that would accompany migrating to sha256 vs. staying with sha1.

At this point it's almost like you want a new tool/new command. `git` vs. `git2`. New projects use git2, existing projects use git (or something like that). Otherwise confusion and backwards-compatibility breakage will abound.


> New projects use git2, existing projects use git (or something like that). Otherwise confusion and backwards-compatibility breakage will abound.

Like uh, python2 and python3? ;-p


The current plan is for Git to essentially use SHA-1 hashes as aliases for SHA-256, using a lookup table. This would mean that any given SHA-1 hash would, in a particular repository, map to one and only one SHA-256 hash, which is then used to retrieve the object. Eventually the SHA-1 hashes would be deprecated (via a per-repository mode switch, so every Git user or host would migrate according to their preferences).

https://git-scm.com/docs/hash-function-transition/


The difference is that this breakage would be immediately visible. All code that mishandles these hash would immediately break. With collisions it can remain undetected for a long time, and potentially until somebody smart and with bad intentions finds a way to break your system in some creative way.

And again, if we had done this when it should have been done, i.e. pre-2010, we wouldn't be having this discussion. The longer we wait the more painful the migration will be whenever somebody manages to actually bruteforce collisions for git commits. We're not there. Yet.


If the lead solder is actually functional in some way then they may have to attempt to find an exception under RoHS. They could attempt to define the usage so as to have an application not covered by RoHS for example.

This analogy is kind of confusing because RoHS is an imposed standard. A user of SHA1 is expected to make their own decision about appropriate usage. They might reasonably continue the usage of SHA1 for their specific use case. The real world is full of such compromises.


> SHA-1 will still work fine for the purpose of git.

So why are they changing it? That's pretty strong evidence it's not fine. I found this Stackoverflow question, "Why does Git use a cryptographic hash function?" [1], which points to [2]. Note: pretty much every DVCS uses a cryptographic hash function. That doesn't seem like an accident.

Reading through some of these old posts and threads it seems like performance was the main factor combined with the expectation that SHA1 collisions just wouldn't be an issue. The latter I find to be surprisingly naive.

[1]: https://stackoverflow.com/questions/28792784/why-does-git-us...

[2]: https://ericsink.com/vcbe/html/cryptographic_hashes.html


And, to that point, I'm not really convinced that a cryptographic hashing algorithm is really a great choice for git.

It is nice that it checks off the boxes for even distribution of hashes, but there's a bunch of other hashing algorithms that can do that without the performance penalty inherent in crypto hashes. For example, FNV seems like a good fit for something like git.


Is hashing a significant bottleneck in any git deployment? I'd expect that the diffing would be vastly more expensive for instance.

Besides don't many modern CPU supports things like SHA-256 in hardware?


FNV is really cool and has a reasonable quality given its simplicity. But it does have issues (sticky state, and I think the avalanche characteristics also weren't great) that are solved by only slightly more complex hashing algorithms.


> SHA-1 will still work fine for the purpose of git.

So why are so many corporations, individuals, and orgs working hard to protect against it?

Hint: because it's not actually fine.


For someone to be able to break your repo using sha1 collisions, they need to be able to commit to it.

If you don't trust someone, don't let them commit to your repo.

> Were they so in love with the performance of C/C++ being able to pass around 20 byte hashes on the stack without worrying about a more complicated structure (eg a collection of variable length hashes)?

The hashes show up everywhere. They're how every object in git is identified. They make it onto disk. They're not just commit ids.

Changing hashes changes the disk format.


They might not need to commit to your repo to break it using SHA-1 collisions.

They might, for example, compute a collision, commit the malicious version to a private GitHub repository, and then send you a patch with the non-malicious version via email, or by sending you a merge request on GitLab. When you accept and merge in their changes and then push to GitHub, perhaps GitHub will recognize the hash of one of your blobs as a hash that it already knows the blob for (from the private commit), and so it will ignore whatever data you send it for that blob.

Maybe that particular scenario won't work — I haven't tried it — but there are hundreds of possible angles of attack. Fundamentally the security of Git is based on the idea that hashes don't collide.


The correctness of git is based on hashes not colliding within a repo.

This just means GitHub may need to give up on a cost saving measure like dedup between forks It's a GitHub issue, not a git issue.


Agreed. There are lots of other scenarios, though.


Yeah I don't think any of these proposed attacks work without write access to the repo, at which point the game is already pretty much over already.


There's plenty of workflows where untrusted people have their proposed changes audited and then written to a git repo


How do they get through the auditing?

If I'm looking at a PR which contains a hash collision with code that does a `sudo rm -rf --no-preserve-root /` and I don't accept the malicious code, the hash collision becomes irrelevant.


Git is designed around content-addressed storage and while you can cater for hash migrations, it gets pretty messy design-wise. Gits core data structures were also designed really quickly to patch a very urgent need of the kernel hackers. I doubt it has anything to do with being in love passing 20 byte strings on the stack. The fine git folks have produced a rather detailed document about the hash migration (https://git-scm.com/docs/hash-function-transition/) and it is not particularly simple to do this.


Git was not intended (AFAIK) to be cryptographically secure. Being unsuitable for crypto != being unsuitable for other uses.


Surely signed tags and signed commits in git are supposed to be cryptographically secure? [0]

Doesn't the security of those signatures depend on the security of the SHA-1 hashes that are being signed?

[0] https://git-scm.com/book/en/v2/Git-Tools-Signing-Your-Work


No - after all, the most common use of digital signatures is to sign documents that can be easily tampered. All the security is in the signature, not the content being signed.


This makes no sense. Sure: you can generate an X.509 certificate that says whatever you want, but the point is that you can validate the signature and see that it's a forgery. In the case of a hash-addressed system like git, the problem is that the signature is over a collision, so it no longer certifies the thing its supposed to certify. Git uses the hash as a shorthand for a revision, including its entire history--so yes, it is using the hash that way.

By that logic, would MD5 be fine? MD4? CRC32?


A digital signature usually signs (~encrypts) a hash of the content. So asking what exactly signing a commit or tag entails is a very valid question. I would expect that signing a tag is only as strong as SHA-1, since a tag is essentially a label for a commit hash. For signing commits I have no clue, but would be quite interested as well.


Linus posted about it on google plus in 2017. I haven't re-read it yet, but I remember one of the ideas floating around hn at the time was to just have two hashes per commit. That is, two insecure hashes may be secure together for git's purposes. Even though we can generate collisions for md5, and sha1, it would be much more difficult to have a file generate an arbitrary collision for both at the same time.

Here is a link to Linus's post on the internet archive

https://web.archive.org/web/20170717192607/https://plus.goog...

And here are some hn posts around the same time

https://news.ycombinator.com/item?id=13733481

https://news.ycombinator.com/item?id=13719368


> That is, two insecure hashes may be secure together for git's purposes.

Generally that's not true for Merkle–Damgård (e.g. sha1, sha2) hashes - afaik the difficulty of finding a collision in the more difficult hash is the same (up to big-oh) as finding it in both hashes.


In git I don't know, but in fossil, it already does that if the file that the collision is generated for is not the manifest file. The manifest file and all other files are identified by the SHA-1 hash (in newer versions SHA3-256 is also supported), although the manifest file also has a R card which is the MD5 hash of the all of the other files. This means that if there is a collision, it will be detected and the file would be rejected.


> I'm honestly still shocked that updating the hashing algorithm wasn't built into Git from day one. I really wonder why.

The first basically functional version of Git was created in less than a month by Linus Torvalds, immediately after Bitkeeper revoked the license/permission for linux kernel developers to use that proprietary source control tool for free. Linus took a look at Monotone and Mercurial IIRC, but they were not nearly fast enough at that time, so he created Git to just do what the linux kernel development community needed.

This isn't the first time Linus did something that academics said was a terrible mistake, and then his creation took over the freaking world, because it works so freaking well. (And it still does today, 16 years later. Maybe tomorrow it won't, I dunno. But it probably will.)


> Linus took a look at Monotone and Mercurial IIRC, but they were not nearly fast enough at that time

That's incorrect for Mercurial, it was started by another kernel developer (Matt Mackall) at the same time, for the same reasons (but they ended-up with different trade-offs, which makes mercurial still very interesting and relevant in a post-github world).


"We note that classical collisions and chosen-prefix collisions do not threaten all usages of SHA-1."


> going to cause pain

What kind of attack on a git repo are you worried about?


Subrepo hash substitution could enable a supply-chain attack


It is a law of nature that all successful products find themselves not accounting for some predictable problems.


FYI signed git commits and tags weren't a thing back then.


> I'm honestly still shocked that updating the hashing algorithm wasn't built into Git from day one.

The name literally means stupid person.


I have a question.

Is it not possible to create a good cryptographic hashing algorithm that results in 128bit or 160bit hash?

It seems all the modern, secure ones are 256 bits or larger. Is that because 160 bits is too easy to compute collisions for any algorithm, or are we just not able to fully realize the entropy that can be recorded into 160 (or even 128) bits?


> Is it not possible to create a good cryptographic hashing algorithm that results in 128bit or 160bit hash?

No, it is provably not possible to do that. A N-bit hash function has maximum N/2-bits of collision resistance. So a 128-bit hash function (like MD5) has a maximum of 2^64 bits of security against a classical attacker, even if it wasn't broken in other ways (as MD5 is). A 160-bit hash function like SHA-1 would have 2^80 bits of security against collision attacks on a classical computer.

Both MD5 and SHA-1 are broken in other ways, but you could for example use SHA-2 or SHA-3 in truncated mode if you wanted a "secure" 128-bit or 160-bit hash output. Indeed there are standard operating modes for doing so, meant for when you need a drop-in replacement for MD5 or SHA-1.

But fundamentally, 64-bit or 80-bit security is too low for any except some highly specified use cases. And even then the extra bits of a stronger hash will rarely kill you, so why bother? Just use SHA-2 or SHA-3 and not worry about it.

In cases where you can demonstrate that you only care about preimage resistance and not collision resistance, then a 128-bit hash would be sufficient. However often collision attacks crop in in unexpected places or when your protocol is used in ways you didn't design for. Better to just double the hash size and not worry about it.


(Jan 7, 2020)


The chosen-prefix collision with a complexity of 2^63.4 (2020).


Looks like they needed 1.5 millions GPU hours to find a collision with a random hash.


I’ve wondered for a while if it is viable to use multiple hashes, or nested hashes, to mitigate these vulnerabilities, for example:

message+hash1(message)+hash2(message) message+hash1(hash1(message))

To my lay understanding, it would provide multiple chained validation steps, but I’m curious if there are any obvious flaws with this model.


This a common lay-person approach to "strengthening" cryptography, but is not as effective as using a better algorithm instead.

For example, SHA256 and SHA512 require approximately the same number of "cycles per byte" when using modern CPU instruction sets. So 2× SHA256 is 2× slower than 1× SHA512. On some processor models, 1× SHA512 is faster than 1× SHA256!

Of course, things aren't always this simple, but you get the idea.

Similarly, for some class of attacks, it's only 2× as much work to crack 2× SHA256 as it would take to crack 1×SHA256. However, for these attacks it is (2^256)× more work to crack SHA512, which takes the difficulty increase from "slightly more" to "absolutely impossible in this physical universe".


If the hash function does its job at mixing input, this is no different than applying just the final hash.

(With the technical caveat that doing at least two serial hashes does improve the properties of the hash function by defending against length extension attacks. That's why things like HMAC do multiple chained hashes. But it does zilch for collision or preimage resistance.


Some past threads:

SHA-1 collisions now cost $45k [pdf] - https://news.ycombinator.com/item?id=23350223 - May 2020 (62 comments)

The first chosen-prefix collision for SHA-1 - https://news.ycombinator.com/item?id=21979333 - Jan 2020 (352 comments)

Abusing SHA-1 collisions for Chromium updates - https://news.ycombinator.com/item?id=20114809 - June 2019 (36 comments)

A SHA-1 chosen-prefix collision attack - https://news.ycombinator.com/item?id=19907127 - May 2019 (71 comments)

From Collisions to Chosen-Prefix Collisions Application to Full SHA-1 [pdf] - https://news.ycombinator.com/item?id=19878917 - May 2019 (18 comments)

SHA-1 Collision Detection on GitHub.com - https://news.ycombinator.com/item?id=13917990 - March 2017 (90 comments)

Linus' reply on Git and SHA-1 collision - https://news.ycombinator.com/item?id=13719368 - Feb 2017 (262 comments)

When Will We See Collisions for SHA-1? (2012) - https://news.ycombinator.com/item?id=13719079 - Feb 2017 (1 comment)

Announcing the first SHA-1 collision - https://news.ycombinator.com/item?id=13713480 - Feb 2017 (485 comments)

How would Git handle a SHA-1 collision on a blob? - https://news.ycombinator.com/item?id=13547348 - Feb 2017 (5 comments)

Why it’s harder to forge a SHA-1 certificate than to find a SHA-1 collision - https://news.ycombinator.com/item?id=10778773 - Dec 2015 (43 comments)

The Cost of Creating Collisions Using SHA-1 - https://news.ycombinator.com/item?id=8629906 - Nov 2014 (9 comments)

When Will We See Collisions for SHA-1? - https://news.ycombinator.com/item?id=4618069 - Oct 2012 (51 comments)


> Our work show that SHA-1 is now fully and practically broken for use in digital signatures

Yet 2nd preimage resistance is intact. Zzz.


Old collision (2020). ;-)


Obligatory link: https://shattered.io/


That was an earlier break; this article is about https://sha-mbles.github.io/


The article states that the collision found is something different than that one found by Google et al.

It's a chosen-prefix attack.


Of course it has its own website ;)


thats it, it's over, git is finished




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: