Git archive checksums may change

vtbassmatt · on Jan 30, 2023

Hey folks. I'm the product manager for Git at GitHub. We're sorry for the breakage, we're reverting the change, and we'll communicate better about such changes in the future (including timelines).

Also posted here: https://github.com/bazel-contrib/SIG-rules-authors/issues/11...

rsc · on Jan 30, 2023

Thanks for the quick rollback.

I want to encourage you to think about locking in the current archive details, at least for archives that have already been served. Verifying that downloaded archives have the expected checksum is a critical best practice for software supply chain security. Training people to ignore checksum changes is training them to ignore attacks.

GitHub is a strong leader in other parts of supply chain security, and it can lead here too. Once GitHub has served an archive with a given checksum, it should guarantee that the archive has that checksum forever.

matthewcroughan · on Jan 31, 2023

I've just had a thought. When GitHub do update the hashing for better compression, everyone relying on the tar hash will update their hashes. This is the ultimate opportunity to change the tar contents, effect the supply chain, introduce vulnerabilities, and have everyone trust you. Something like Nix which computes the NAR Hash (the result of the tar contents) will not be effected by this, since it only cares about the content. I think this is much better than worrying about an unlikely tar vulnerability. In a system that only trusts the tar hashes, the original source is not able to take advantage of better compression over time, without massive risk of supply chain attack. If you think you can hand me a tarball that can run arbitrary code, for any version of tar that has ever existed, please give it to me so I can experiment with exploits, and I'll buy you a drink of your choice at FOSDEM if you're there!

rsc · on Jan 31, 2023

You're not wrong, but you're also not being realistic.

Nix is not the only system that takes this approach. The Go modules "directory hash" is roughly equivalent, although we defined it in terms of somewhat more standard tooling: it is the output of

    sha256sum $(find . -type f | sort) | sha256sum

I am not here advocating that everyone switch to this basic directory hash either, because it's not a solution to the more general problem that many systems are solving, namely validating _any_ downloaded file, not just file archives.

There are widespread, standard tools to run a SHA256 over a downloaded file, and those tools work on _any_ downloaded file. Essentially every programming language ships with or has easily accessible libraries to do the same. In contrast, there are not widespread, standard tools or libraries for the "NAR Hash" nor the Go "directory hash". Even if there were, such tools would need to be able to parse every kind of file that people might be downloading as part of a build, not just tar files.

It's a good solution in limited cases such as Nix and Go modules, but it's not the right end-to-end solution for all cases.

matthewcroughan · on Jan 31, 2023

When you say it is not the right end-to-end solution for all cases, I am wondering what case you have in mind that a NAR Hash would not be suitable for.

If you adopt Nix fully, the .narinfo file that cache.nixos.org (a Nix substituted) serves that is signed, contains both the NAR Hash and the hash of the NAR Archive File as well. Additionally, NAR packs and unpacks deterministically, and you can read the implementation in the Nix thesis.

A .narinfo file looks like this:

```

StorePath: /nix/store/xvp2wr01fi27j0ycxqmdg6q4frsiv82s-libnotify-0.8.1 URL: nar/0a4jjqxwjcnnaia76l64drq9bjw7jczgmrirzshgp0bnw621f1c9.nar.xz Compression: xz FileHash: sha256:0a4jjqxwjcnnaia76l64drq9bjw7jczgmrirzshgp0bnw621f1c9 FileSize: 24324 NarHash: sha256:02bh3qjxgph5g9di3q553k87w4kbc4drmflkfz9knqbp9jip98c5 NarSize: 101776 References: 7ncncvnr864iangwbvbgbanx1r6wpf79-gdk-pixbuf-2.42.10 i4dqcpppyyq5yqcvw95mv5s11yfyy8pf-glib-2.74.3 xvp2wr01fi27j0ycxqmdg6q4frsiv82s-libnotify-0.8.1 yzjgl0h6a3qh1mby405428f16xww37h0-glibc-2.35-224 Deriver: 2vjs6q5j5vqckcwsvmh5lajvx3p7arkj-libnotify-0.8.1.drv Sig: cache.nixos.org-1:IqCAJROaqNx4TthRv9V47/dM7KP4sR+bBWBfL+9xSqQHAezcfczYdJhKj8nl5l+iFnj8O4uTIJMWNOcwVq8+AA==

```

rsc · on Jan 31, 2023

> If you adopt Nix fully, ...

The case where Nix is not adopted fully is the one I have in mind.

matthewcroughan · on Jan 31, 2023

This is the only case then?

rsc · on Jan 31, 2023

My point is about (1) the broader ecosystem of tools that may need to interoperate and have easy access to "SHA256 the whole file" and (2) the fact that not everything is a tar file that the Nix tools can process. So yes, that's the "only" case.

matthewcroughan · on Feb 1, 2023

So what about the IPFS CAR format? https://car.ipfs.io/, it would fulfill a lot of what I expect from NAR too. NAR or CAR, I don't care, I believe the content is what matters, not the container format.

If I have a box with an apple in it, I don't care about the box, I care about the apple inside. If it's not an apple, I don't want to eat it.

bentley · on Jan 30, 2023

I would also appreciate stronger advertising of the ability to turn a Git tag into a GitHub release and upload stable source code files to it. Maybe even a button in the GitHub releases interface to “generate source tarball and attach as stable tarball to this release.”

misnome · on Jan 31, 2023

But this isn’t a great solution, because afterwards there is now three, or four source download links, some of which are stable.

Not to mention, forcing people to use GitHub releases instead of just tags (which excludes every mirror of somewhere else)

mathstuf · on Jan 31, 2023

I agree this would be great. However, it should also stop you from providing useless tarballs (as `/archive/` does today) if:

- you use autoconf (or any other tool(s) that require generating code into the source archive; or - you have submodules (to which `git archive` is completely blind).

Note that `git-archive-all`[1] can help as long as your submodules don't do things like `[attr]custom-attr` in their `.gitattributes` as it is only allowed in the top-level `.gitattributes` file and cannot be added to the tree otherwise.

[1]https://github.com/roehling/git-archive-all

account42 · on Jan 31, 2023

Yeah, it would be nice if you could disable the generated archive links for releases or at least de-emphasize them.

matthewcroughan · on Feb 2, 2023

https://floxdev.com/blog/hash-collision

vtbassmatt · on Jan 30, 2023

We updated our Git version which made this change for the reasons explained. At the time we didn't foresee the impact. We're quickly rolling back the change now, as it's clear we need to look at this more closely to see if we can make the changes in a less disruptive way. Thanks for letting us know.

phphphphp · on Jan 30, 2023

Consumers often mistake hasn’t changed for a commitment to never change: any sufficiently large product will be littered with these kind of implicit commitments made by the product to consumers that nobody has visibility into. You’re unfortunate that we were all relying on this commitment you’ve never made, but the quick reversion is the best we can hope for. People will theorise how this could have been avoided but c’est la vie — easy mistake that you’ve responded well to.

dharmab · on Jan 30, 2023

Hyrum's Law:

With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody.

modderation · on Jan 31, 2023

Incidentally, I'm still waiting for GitHub to implement Spacebar Heating: https://m.xkcd.com/1172/

nickitolas · on Jan 30, 2023

FWIW according to https://github.com/bazel-contrib/SIG-rules-authors/issues/11... a commitment was made, although in an exchange in some support ticket, and not in documentation.

VWWHFSfQ · on Jan 30, 2023

At this point they'll be stuck on old git for all of eternity unless they just roll their own archive/compress step out of band so the old hashes still work. Yikes.

rfoo · on Jan 31, 2023

New git has a flag for keeping the old behavior, so it's not as bad.

groestl · on Jan 30, 2023

They could also brownout the implied contract over a longer timespan.

jeffbee · on Jan 30, 2023

They could use the old behavior for archives in which all the inputs predate the changeover.

johnklos · on Jan 30, 2023

[flagged]

VWWHFSfQ · on Jan 30, 2023

Ironic that "open source packaging systems" rely on proprietary Microsoft hosting and distribution to function.

I think you meant _poorly implemented_ open source packaging systems.

bentley · on Jan 30, 2023

Of the 11,656 packages in OpenBSD’s package repos, 2,984 are built from source originally hosted on GitHub or Sourceforge. That’s a full 25%.

Moralize all you want about where these upstreams should host their software, but why claim that the downstream package manager is “poorly implemented” to fetch source code from those hosts? Your complaint was not technical—you imply the proprietariness of Microsoft servers is the problem (although open source servers like GitLab also have the problem of unstable checksums)—but HTTPS is HTTPS.

VWWHFSfQ · on Jan 30, 2023

I'm fairly certain that Homebrew doesn't function without GitHub. The index (and Cargo's index, and probably others) is hard-coded to be hosted on GitHub.

denom · on Jan 31, 2023

Exactly this.

Tbh, i would expect some developer working on this feature to have an ‘a-ha’ moment: ‘I’m a homebrew user … hey wait a minute’

ilyt · on Jan 31, 2023

> Moralize all you want about where these upstreams should host their software, but why claim that the downstream package manager is “poorly implemented” to fetch source code from those hosts?

Because it should validate checksums of content of the tarball, instead of just the outside blob.

Then:

* you don't care about compression method or implementation * you don't care about archive method or implementation * your system works just as well for "download a tarball" as for "shallow copy the remote repo

chlorion · on Jan 30, 2023

I am not sure how you could not rely on GitHub when packaging code that is hosted on GitHub.

My personal Gentoo ebuilds for example contain a URI variable that points to the GitHub auto-generated archives for projects that use GitHub for hosting. What am I supposed to do in this case?

The only option here is to setup a mirror and have a backup of the data. Packages that are in the official Gentoo repository do get mirrored, but overlays do not, and most people probably don't have the ability, time or money to setup their own mirroring service.

I agree that relying on GitHub sucks to be clear, but I don't think we can blame package managers for fetching code from it when people are hosting their projects there!

pabs3 · on Jan 31, 2023

Software Heritage backs up GitHub, they aren't really designed to be a mirror though.

https://www.softwareheritage.org/

Some GitHub repos also end up on archive.org, but not systematically.

pxc · on Jan 30, 2023

> I think you meant _poorly implemented_ open source packaging systems.

or under-resourced ones. If the upstream source only appears on GitHub, without formal release tarballs, your only options as a downstream packager are literally to get the source from GitHub or host your own mirror of every source tarball you build yourself.

Denvercoder9 · on Jan 30, 2023

Or get the source code using Git, which actually (by design) guarantees that its checksums are stable.

ecnahc515 · on Jan 31, 2023

Downloading a source tarball is significantly cheaper on both sides than git. A source tarball is 100% served from CDNs, whereas I don't believe the same isn't quite true for git (even over https).

pxc · on Jan 31, 2023

That's a good point.

It's way more resource-intensive and much slower, which is why it's not preferred in Nixpkgs, for example.

But it's also vulnerable to the same problem in that your package manager's build system is still dependent on GitHub. It will take more to screw you up, but a whole GitHub outage, for example, will definitely still hurt.

ilyt · on Jan 31, 2023

It's not depending on github-specific functionality tho. You can just redirect it to another mirror of git repo of the project

SSLy · on Jan 30, 2023

Most source-based pkg managers build from release tarballs, or even preprocessed tarballs after doing autoconf

bentley · on Jan 30, 2023

Sadly there has been a sharp uptick in software that provides no release tarballs anymore. With the rise of GitHub many upstreams choose to make a tag and let people download the autogenerated tarballs, despite the fact that they won’t contain preprocessed autoconf or (more importantly) any Git submodules.

The situation is deteriorating further as some projects make no releases at all, assuming users will add the project’s own package mirror to the user’s trusted package repositories, or use Docker.

JoshTriplett · on Jan 30, 2023

> many upstreams choose to make a tag and let people download the autogenerated tarballs

Which is fine as long as you rely on the hash of the tag rather than the hash of the tarball.

> despite the fact that they won’t contain preprocessed autoconf

This is a feature; run `autoreconf -vfi` at build time, so that you don't depend on the maintainer's idiosyncratic autotools setup and local macros, and so that you can reliably regenerate it all if you want to change configure.ac or Makefile.am.

bentley · on Jan 30, 2023

> This is a feature; run `autoreconf -vfi` at build time, so that you don't depend on the maintainer's idiosyncratic autotools setup and local macros, and so that you can reliably regenerate it all if you want to change configure.ac or Makefile.am.

On a package bulk build machine, that’s a lot (like, a lot) of wasted CPU cycles multiplied by the thousands of packages that use autoconf. For the majority of packages that don’t patch configure.ac or Makefile.am, it’s nicer to use a preprocessed tarball and check that you can reproduce the same autoconf output when adding the package to the package manager, because then it only happens once.

JoshTriplett · on Jan 31, 2023

There are a lot of things that waste cycles on a build machine, but they're worth the reproduciblity. I think that problem would be better solved via caching, ideally.

I would hazard a guess that there are far fewer people these days who download a tarball and `./configure` `make` `make install` than there are distros (who often need to patch) and developers (who will be working from git anyway).

pabs3 · on Jan 31, 2023

Having autocruft in the default tarball is a design flaw in autotools, you should never ship prebuilt files in git nor in source tarballs. I think `make distcheck` should by default put the autotools files into a separate foo-1.2.3.4-2023-01-31-autocruft.tar alongside the real source tarball generated by git-archive.

pabs3 · on Jan 31, 2023

The distros always run `autoreconf` these days, so that they can verify that they can still build the build system from the source configure.ac/Makefile.am files.

mdouglass · on Jan 30, 2023

We are seeing an npm install failure inside our docker builds pointing at a github URL with a SHA change. Is this possibly related?

  #15 [dev-builder 4/7] RUN --mount=type=secret,id=npm,dst=/root/.npmrc npm ci
  #0 4.743 npm WARN deprecated [email protected]: The querystring API is considered Legacy. new code should use the URLSearchParams API instead.
  #0 8.119 npm WARN tarball tarball data for http2@https://github.com/node-apn/node-http2/archive/apn-2.1.4.tar.gz (sha512-ad4u4I88X9AcUgxCRW3RLnbh7xHWQ1f5HbrXa7gEy2x4Xgq+rq+auGx5I+nUDE2YYuqteGIlbxrwQXkIaYTfnQ==) seems to be corrupted. Trying again.
  #0 8.164 npm ERR! code EINTEGRITY
  #0 8.169 npm ERR! sha512-ad4u4I88X9AcUgxCRW3RLnbh7xHWQ1f5HbrXa7gEy2x4Xgq+rq+auGx5I+nUDE2YYuqteGIlbxrwQXkIaYTfnQ== integrity checksum failed when using sha512: wanted sha512-ad4u4I88X9AcUgxCRW3RLnbh7xHWQ1f5HbrXa7gEy2x4Xgq+rq+auGx5I+nUDE2YYuqteGIlbxrwQXkIaYTfnQ== but got sha512-GWBlkDNYgpkQElS+zGyIe1CN/XJxdEFuguLHOEGLZOIoDiH4cC9chggBwZsPK/Ls9nPikTzMuRDWfLzoGlKiRw==. (72989 bytes)
  #0 8.176 
  #0 8.177 npm ERR! A complete log of this run can be found in:
  #0 8.177 npm ERR!     /root/.npm/_logs/2023-01-30T23_19_36_986Z-debug-0.log
  #15 ERROR: process "/bin/sh -c npm ci" did not complete successfully: exit code: 1

This was working earlier today and the docker build/package.json haven't changed.

andrewguenther · on Jan 30, 2023

Yes, this is the exact issue being described

mdouglass · on Jan 30, 2023

That's what I thought, but I assumed with the rollback an hour plus ago, it wouldn't still be happening. That was off a build just a few minutes ago (actually repeated it in between the time I posted my original message and this reply and it happened again).

VWWHFSfQ · on Jan 30, 2023

Most likely a caching layer at GitHub still has the pre-rollback archive.

voidbip · on Jan 31, 2023

Just want to second this. Still seeing an issue in our build right now that seems related.

``` Building aws-sdk-cpp[core,dynamodb,kinesis,s3]:x64-linux... -- Downloading https://github.com/aws/aws-sdk-cpp/archive/a72b841c91bd421fb... -> aws-aws-sdk-cpp-a72b841c91bd421fbb6deb516400b51c06bc596c.tar.gz... [DEBUG] To include the environment variables in debug output, pass --debug-env [DEBUG] Feature flag 'binarycaching' unset [DEBUG] Feature flag 'manifests' = off [DEBUG] Feature flag 'compilertracking' unset [DEBUG] Feature flag 'registries' unset [DEBUG] Feature flag 'versions' unset [DEBUG] 5612: popen( curl --fail -L https://github.com/aws/aws-sdk-cpp/archive/a72b841c91bd421fb... --create-dirs --output /home/*redacted*/vcpkg/downloads/aws-aws-sdk-cpp-a72b841c91bd421fbb6deb516400b51c06bc596c.ta r.gz.5612.part 2>&1) [DEBUG] 5612: cmd_execute_and_stream_data() returned 0 after 12643779 us Error: Failed to download from mirror set: File does not have the expected hash: url : [ https://github.com/aws/aws-sdk-cpp/archive/a72b841c91bd421fb... ] File path : [ /home/*redacted*/vcpkg/downloads/aws-aws-sdk-cpp-a72b841c91bd421fbb6deb516400b51c06bc596c.tar.gz.5612.part ] Expected hash : [ 9b7fa80ee155fa3c15e3e86c30b75c6019dc1672df711c4f656133fe005f104e4a30f5a99f1c0a0c6dab42007b5695169cd312bd0938b272c4c7b05765ce3421 ] Actual hash : [ 503d49a8dc04f9fb147c0786af3c7df8b71dd3f54b8712569500071ee24c720a47196f4d908d316527dd74901cb2f92f6c0893cd6b32aaf99712b27ae8a56fb2 ] ```

kris-nova · on Jan 30, 2023

Thanks for the update! There is only 1 internet to watch and learn from. We are all in this together. <3

denom · on Jan 30, 2023

In my particular use-case, I'm using a set of local dev tools hosted as a homebrew tap.

The build looks up the github tar.gz release for each tag and commits the sha256sum of that file to the formula

What's odd is that all the _historical_ tags have broken release shasums. Does this mean the entire set of zip/tar.gz archives has been rebuilt? That could be a problem, as perhaps you cannot easily back out of this change...

lozenge · on Jan 30, 2023

They never really stored them, they were always generated by some code (maybe with a cache layer in front). The code changed in a way that changed the bytes in the tar.gz without affecting their contents-when-extracted.

crote · on Jan 30, 2023

The trick here is that a Github release is in essence simply a tag of a specific commit. There is no need to build archives in advance, as they can be dynamically generated from the git repo.

However, if you change the compression algorithm used to generate the archive, it'll result in a different checksum! The content is the same, but the archive is not.

Denvercoder9 · on Jan 30, 2023

> Does this mean the entire set of zip/tar.gz archives has been rebuilt?

They are probably generated on-demand (and cached) from the Git repository, not prebuilt.

scyrybdis · on Jan 30, 2023

I think the zip/tar.gz archives are being created on the fly when you download them, probably with a caching layer in front.

tinus_hn · on Jan 30, 2023

Pretty bizarre this ever was stable in the first place.

Unfortunately for this kind of service you need to actively fiddle with the bytes to prevent people from relying on an implementation detail like this and prevent them from digging you into a too big to fail api stability hole.

denom · on Jan 31, 2023

Alternatively, they could extract the compression code and maintain it for repo tags created before the git algo update release date.

Isn’t that the only humane course given all that depends on this?

ecnahc515 · on Jan 31, 2023

That's my thought as well. They could also potentially retroactively generate the source tarballs using the old method for every possible repository/tag on Github, store it, and serve that, and then only generate it on-demand for new tags, but I doubt they'll do that. They might though, given this is what led to the problem in the first place (ie; the on-demand generation vs generating on push+storing).

Kwpolska · on Jan 31, 2023

That seems wasteful. Many projects do not actively advertise the GitHub tag downloads, and instead have their own stored and stable tarballs (or other distributions). And I suppose many users of those auto-generated downloads don’t care about their checksums.

tinus_hn · on Feb 1, 2023

The solution is simple: for a monthly payment of $1 per gigabyte the downloads are stable. Otherwise they are not.

Kwpolska · on Feb 4, 2023

Who should pay for this? The developer of the project might not care about the stability, but the maintainers of the various Linux distros might. Should the developer pay from their own pocket to make the Linux folks happy? Should there be some pool of all the Linux distros to collect the fees and pay for these projects?

vlovich123 · on Jan 30, 2023

Hyrum's Law strikes again. It kind of doesn't matter what you document. If you weren't randomizing your checksum previously [1], you can't just spring this on the community and blame it for the fallout. I'm more shocked that there's resistance from the GitHub team saying "but we documented this isn't stable". Default stance for the team should be rollback & reevaluate an alternate path forward when the scope is this wide (e.g. only generating the new tarballs for future commits going forward).

[1] Apparently googlesource did do this and just had people shift to using GitHub mirrors to avoid this problem.

blueflow · on Jan 30, 2023

But look at it from the other side. Users that don't read your documentation and expect your software to work like they imagined are just a huge pain in the ass.

vlovich123 · on Jan 30, 2023

Fact of life: the vast majority of your users do not read your documentation (or do not do so carefully enough that what you put in your docs is an ironclad proof that all users adhere to). That's literally what Hyrum's law is about. Of course, you can choose to do whatever you want. It's valuable to recognize of course that you're trading off good will from your users with whatever technical improvement is getting made. Sometimes it's appropriate and inevitable (e.g. old behavior is just wrong or harmful and better to cut off). In the vast majority of cases though it's better to just have a better process in place to manage this with minimal disruption, identifying and communicating with broken users, and only then making that change.

blueflow · on Jan 30, 2023

Thats support you could expect if you paid for it.

vlovich123 · on Jan 30, 2023

Look. Even vcpkg broke which is a Microsoft product. I agree that there can be a continuum some times, but can we agree that this specific instance isn't anything like that? Even without vcpkg, the list of things impacted are anything that depends on Bazel, homebrew, conan, etc. The blast radius is quite wide regardless of documentation.

ilyt · on Jan 31, 2023

Aint nobody give a shit about you if you aren't bringing five or six figures as customer. Nobody is stopping rewrite that happened to break undocumented stuff you relied on if you $10/mo.

This case is different as breakage probably affected github/microsoft themselves

ZephyrBlu · on Jan 30, 2023

You just described >90% of users. Everyone does this for something, most people do it for most things.

You minimally read the docs, get something working and then leave it alone. Of course you're going to be pissed off when an implicit assumption which has been stable for a long time is broken.

grepfru_it · on Jan 31, 2023

>Of course you're going to be pissed off when an implicit assumption which has been stable for a long time is broken.

This accurately describes my beef with golang

missingdays · on Jan 31, 2023

Yes, but if you implement the checksum algorithm for GitHub archives, shouldn't you read the documentation about archives checksum?

JonChesterfield · on Jan 31, 2023

Turns out scripts contain download an archive from github and check against a hardcoded checksum copy&pasted into that script. All of those broke. None of the authors will have looked up exactly how github had calculated said checksum.

dataflow · on Jan 30, 2023

I don't think expecting users to go look for a user manual on each website whose links they download from is a realistic expectation.

blueflow · on Jan 30, 2023

Worse, you can't expect other people to host your data for free, forever. If you want your data distributed, you need to check first if the platform is suitable for your purposes.

dataflow · on Jan 30, 2023

I don't believe paid users saw any different behavior here?

lupire · on Jan 31, 2023

If you don't want users, feel free to ignore them.

throwawaylinux · on Jan 31, 2023

If your product supports some particular behavior, it will be used regardless of what you document.

Microsoft was once renown for bug-compatibility so as not to break their users. The new wave of movers and breakers would forget that wisdom at their peril.

mr_toad · on Jan 30, 2023

Give a man a fish and he’ll assume he’s entitled to a lifetime supply of free fish.

dataflow · on Jan 30, 2023

This has nothing to do with free vs. paid? The question is whether giving someone 99 of the same fish entitles them to expect the 100th one you throw in to be the same kind of fish, whether they paid for it or not.

kkirsche · on Jan 30, 2023

This. You have to draw the line somewhere. Was this specific choice that line? Maybe not, but sometimes users aren’t right and changes just need to occur to ensure other asks from the same users can be delivered.

ilyt · on Jan 31, 2023

I'd imagine they broke their own stuff doing it, considering npm broke on it

KyeRussell · on Jan 31, 2023

Do you work for Google?

hobofan · on Jan 31, 2023

This isn't even a case of "we didn't documented this".

I know that the Bazel team reached out to GitHub in the past to get a confirmation that this behaviour could be relied on, and only after that was confirmed did they set that as recommendation across their ecosystem.

nilsbunger · on Jan 30, 2023

This is especially true of something like a git SHA, which is drilled into your head as THE stable hash of your code and git tree at a certain state. It should be expected that lots of tools use it as an identifier -- heck, I've done so myself to confirm which version of a piece of software is deployed on a particular machine, etc.

Denvercoder9 · on Jan 30, 2023

The Git commit hashes didn't change (that'd actually be a serious problem). The hash of a compressed archive of the contents of a Git commit changed.

c4mpute · on Jan 30, 2023

Yes, but not in this bug. I guess lots of people missed that distinction: The stable git SHA hash is the commit hash, which is an hash over gits internal representation of the commit object (containing a tree of all file hashes, and parents' hashes).

The hash that pops out of 'git archive' has nothing whatsoever to do with the commit hash and was historically stable more or less by accident: git feeds all files to 'tar' in tree order (which is fixed) and (unless you specify otherwise) always uses gzip with the same options. Since they no longer use gzip but an internal call to zlib, compression output will look different but will still contain the same tar inside.

That people have relied on this archive hash being stable is an indication of a major problem imho, because it might mean that people in their heads project integrity guarantees from the commit hash (which has such guarantees) onto the archive hash (which doesn't have those guarantees). I would suggest randomizing the archive hash on purpose by introducing randomness somewhere, so that people no longer rely on it.

thirtyseven · on Jan 31, 2023

The people that this broke weren't directly depending on the output of git archive being stable, but were assuming that the response data for a particular URL would stay constant. Maybe not a great idea either but not entirely unreasonable IMO.

nilsbunger · on Jan 31, 2023

Oh interesting. But if an archive hash isn’t stable, how is it meant to be used? What’s it good for?

c4mpute · on Jan 31, 2023

In git, there is no intended use for it.

That people use it comes from how releases were usually published (independent of any version control system) as tgz/zip archives on some project website or ftp server. Websites and ftp servers were often mirrored to e.g. ISP or university mirrors because bandwith was scarce and CDNs were expensive/absent. To make sure that your release download from the university of somestrangeplace ftp matches the official release, you would compare the archive hash from the official project website with the hash of the archive you downloaded (bonus points for a GPG signature on the archive).

This then got automated by build/install/package tools to check the package downloaded from some mirror against the hash from the package description. Then GitHub happened, where GitHub replaced the mirror servers, serving autogenerated 'git archive' output instead of static files. And thats where things went wrong here...

vlovich123 · on Jan 30, 2023

To be fair this isn't the git SHA. This is the generated archive (apparently dynamically per request) when you ask for a source tarball.

daniealapt · on Jan 30, 2023

https://xkcd.com/1172/

sneak · on Jan 30, 2023

It's Microsoft. Just as the Apple of today is not the Apple of ten years ago, the GitHub today is not the GitHub of ten years ago. It's literally different people.

The people who made the things you love have mostly moved on, and the brand is being run by different people with different values now.

There's a little bit of an argument that such things are a bait-and-switch, but such is the nature of a large and multigenerational corporation.

naikrovek · on Jan 30, 2023

The Microsoft of today isn't the Microsoft of 10 years ago, either, but that doesn't stop anyone from assuming that today's Microsoft is the same as the Microsoft of 10 years ago.

the logic people use to blame Microsoft is intense, man. literally any logical leap is valid except one that absolves Microsoft of anything, no matter how small.

katbyte · on Jan 31, 2023

Trust is lost quickly and easily and earned back slowly with great difficulty

naikrovek · on Jan 31, 2023

yeah I don't trust slashdot people either.

the number of times the Microsoft-haters are just straight-up factually wrong in their justifications for their complaints is way too high for me to trust them ever again in my life.

lucb1e · on Jan 30, 2023

I didn't even know I should be depending on compression, file ordering, created-at file metadata, etc. being stable when pressing 'download repository as zip' (if I understand correctly what this is about, since the article doesn't really say). Perhaps it could be stable due to caching for a while after you first press it, but when it gets re-generated? I'm very surprised this was reproducible to begin with, given how much trouble other projects have with that.

For projects where I verify the download, gpg seems to be what all of them use (thinking of projects like etesync and restic here). Interesting that so many people relied on a zip being generated byte-for-byte identically every time.

slaymaker1907 · on Jan 31, 2023

I once had a small issue with a deployment at work because of ordering issues within a zip file. That order is important with Spring since that determines which classes are initialized first.

groestl · on Jan 31, 2023

One of the first things I check with every jvm packaging/deployment tool I investigate: does it preserve classpath ordering. Some offenders think -jar * is enough.

rfoo · on Jan 31, 2023

> gpg seems to be what all of them use

GPG signs a hash of the message with the private key, and you verify that the signature matches the file hash.

Oh wait, what hash? :clown:

leoh · on Jan 31, 2023

Many tools set mtime to zero to avoid checksum drift

philipwhiuk · on Jan 31, 2023

There are lots of methods to solve this problem - I imagine this was just easiest at the time given it appeared to work. Bazel devs on the list are discussing the best approach going forward - a simple change is to upload a fixed copy as a release artifact.

frankjr · on Jan 30, 2023

GitHub will need to revert this change. They've just crippled pretty much every "from source" package manager out there.

metrognome · on Jan 30, 2023

Per the post, this was a change to git itself: https://github.com/git/git/commit/4f4be00d302bc52d0d9d5a3d47...

forgotpwd16 · on Jan 30, 2023

What was the thought behind this change?

georgyo · on Jan 30, 2023

If you read the commit message you would see that it is up drop a third party dependency.

forgotpwd16 · on Jan 31, 2023

Yeah, read that. Just don't understand, if git already had an internal gzip implementation, why wasn't it used since it was added?

ilyt · on Jan 31, 2023

Because not everyone refactors whole codebase after adding one function that might be useful somewhere else.

I'd imagine motivation for this change in particular is multiplatform use, not every platform just have gzip in path.

fweimer · on Jan 30, 2023

They could just produce tar output and compress that using system gzip. The “git archive” tool supports many output formats.

acdha · on Jan 30, 2023

If those tools incorrectly assume an API contract which doesn't exist, isn't the right answer to fix those tools?

kentonv · on Jan 30, 2023

In theory, sure, that's what we'd do in an ideal world.

In the real world it will take millions of dollars of eng labor just to update the hashes to fix everything that's currently broken and millions more to actually implement something better and move everyone over to it.

This isn't worth it, GitHub needs to just revert the change and then engineer a way to keep hashes stable going forward.

groestl · on Jan 30, 2023

See also: https://daniel.haxx.se/blog/2013/03/23/why-no-curl-8/

"The amount of work done “out there” on hundreds or thousands of applications for a single little libcurl tweak can be enormous. The last time we bumped the ABI, we got a serious amount of harsh words and critical feedback and since then we’ve gotten many more users!"

kzrdude · on Jan 31, 2023

I know it's superficial but I think the problem would have been reduced if they used a download URL that looked like github.com/archive.php?project=rust&version=deadbeef it's just something that sends a signal and a different expectation on the same artifact.

kzrdude · on Jan 31, 2023

Well, Github presents a file that looks like it comes from a file server, an old "ftp" archive or so. So they model it on that. Already published versions and tar balls should not change in those systems.

I think everyone knows these files are generated on the fly, but it comes from old habits.

nick__m · on Jan 30, 2023

I prefer that tool be adapted to be more resilient and not depend on github particular implementation.

swarfield · on Jan 30, 2023

Using SHA hashes when building guarantees that the code that you are building is what you think it is. How else would you verify dependencies like this, GPG signatures would have the same issue if you change the underlying bits.

Denvercoder9 · on Jan 30, 2023

I wouldn't check the hash of the compressed archive, but of the actual files themselves. It's a bit more metadata, but it's also a lot more robust, and allows you to detect changes after unpacking as well.

bentley · on Jan 30, 2023

It’s generally a bad idea to process (extract) a tarball of unknown provenance. Verifying the tarball is from a known source beforehand mitigates the risk of, say, a malicious tarball that exploits a tar or gzip 0‐day.

shakow · on Jan 30, 2023

But then that's the role of the httpS query with which you will fetch your data.

And if you don't trust your http layer and/or Github's certificate, then you should not trust their archive anyway.

bentley · on Jan 30, 2023

> And if you don't trust your http layer and/or Github's certificate, then you should not trust their archive anyway.

The nice thing about checksumming the tarball is that once you’ve done so, it doesn’t matter whether you trust GitHub or the HTTPS layer or not.

GitHub and its HTTPS cert provide no protection against a compromised project re‐tagging a repo with malicious source, or even deleting and re‐uploading a stable release tarball with something malicious.

viraptor · on Jan 31, 2023

The certificate guarantees the source of the file, not the trust you should put in its contents. I can upload malware as a github project release file and https doesn't change that you shouldn't download/run it.

For software distribution this actually sometimes goes the other way - debian/ubuntu uses http (no s) for their packages, because the content itself is signed by the distribution and this way you can easily cache it at multiple levels.

shakow · on Jan 31, 2023

> I can upload malware as a github project release file and https doesn't change that you shouldn't download/run it.

If you can't trust the archive published by the owner themselves, you are already screwed; a stable hash will just make sure that you trust harder that you are, indeed, downloading contaminated code.

I'm not sure most people here understand how checksums/hashs work, what they protect you against, and what they don't.

c4mpute · on Jan 31, 2023

Software published via GitHub isn't really "published by the owner". The owner typically doesn't control what GitHub does and doesn't always control his own GitHub account.

It isn't only that people don't know what checksums, hashes, and signatures do, it is also problematic that they blindly trust or ignore middlemen. Most supply chain "security" is security theater, almost never is something vetted end-to-end.

ilyt · on Jan 31, 2023

Or just contains 100TB of zeroes

shakow · on Jan 30, 2023

By checking the hash of the extracted files. The hash of the archive is dependent on the order in which the file were compressed, the compression, some metadata, etc.

catiopatio · on Jan 30, 2023

That’s expensive, complicated, exposes a greater attack surface, and requires new tooling to maintain considerably more complex metadata covering the full contents of source archives.

For the entire multi-decade history of open source, the norm has been — for very good reason — that source archives are immutable and will not change.

The solution here isn’t to change the entire open source ecosystem.

Denvercoder9 · on Jan 30, 2023

> For literally the entire multi-decade history of open source, the norm has been — for very good reason — that source archives are immutable and will not change.

Well, the norm has been that maintainers generated and distributed a source archive, and that archive being immutable. That workflow is still perfectly fine with GitHub and not impacted by this change.

The problem is that a bunch of maintainers stopped generating and distributing archives, and instead started relying on GitHub to automatically do that for them.

account42 · on Jan 31, 2023

> That workflow is still perfectly fine with GitHub

It would be perfectly fine if you could prevent GitHub from linking the autogenerated archives from the releases or at least distinguish them in a way that makes it clear that they are not immutable maintainer-generated archives.

ilyt · on Jan 31, 2023

The problem was people assuming github works like that - saves a archive of every commit, which is obviously silly if you think about it (why save it if you can regenerate it on a whim from any commit you want?)

jraph · on Jan 31, 2023

You are speaking about release archives. GitHub's "Download as zip" feature is not the same thing as this multi decade-history of open source thing you are talking about.

I always thought zip archives from this feature was generated on the fly, maybe cached, because I don't expect GitHub to store zip archive for every commit of every repository.

I'm actually surprised many important projects are relying on a stable output from this feature, and that this output was actually stable.

bentley · on Jan 30, 2023

Indeed. I remember when Canonical was heavily pushing bzr and others were big fans of Mercurial. Glad my package manager maintainers didn’t waste time writing infrastructure to handle those projects at the repository level. Nobody had to, because providing source tarballs was the norm.

shakow · on Jan 31, 2023

> That’s expensive, complicated,

That sounds like prejudice. Just as a test, I cloned the git repo, which took 29 seconds, then took its hash with `guix hash`, which took 0.387ms.

I think that if you can't handle a 0.4s delay in a build, you have problem problems.

bentley · on Jan 31, 2023

Package builders work on the scale of thousands of packages. The increased time and CPU usage multiplies greatly.

“Complicated” is indisputable. Cloning a repository is absolutely complicated. Fetching a single file over HTTPS is as simple as it gets, these days.

shakow · on Jan 31, 2023

And you really believe that downloading & extracting a source .tar.gz and compiling it will have a run time much shorter than 0.4s?

Just executing the ./configure will take more than that.

bentley · on Jan 31, 2023

> And you really believe …

Huh? What I fully believe is that downloading a source tarball over HTTPS, verifying its checksum, and extracting it will take less time than cloning the repository from Git, then verifying the checksum of all files—which you said would take 29 seconds plus 0.4s.

shakow · on Jan 31, 2023

My point is that either spending 0.08s computing the md5 of the zip (I just measured) or 0.3s computing the hash of the repo does not matter the slightest if you are managing software repos, as just extracting the source and preparing to build it will be an order of magnitude slower.

ArchOversight · on Jan 30, 2023

a git checkout of the code at that particular tag hasn't changed. Just the tarball that git archive generates has.

vlovich123 · on Jan 30, 2023

The two main problems are:

A) How do you catch tarballs that have extra files injected that aren't part of your manifest

B) What does the performance of this look like? Certainly for traditional HDDs this is going to kill performance, but even for SSDs I think verifying a bunch of small files is going to be less efficient than verifying the tarball.

ArchOversight · on Jan 30, 2023

A wouldn't be an issue since you are checking out a git tag.

B would just be a normal git checkout, which already validates that all the objects are reachable and git tags (and commits for that matter) can be signed, and since the sha1 hash is signed as well it validates that the entire tree of commits has not been tampered with. So as long you trust git to not lie about what it is writing to disk, you have a valid checkout of that tag.

And if you do expect it to lie, why do you expect tar to not lie about what it is unpacking?

vlovich123 · on Jan 30, 2023

I know GitHub had asked that clones from package manager use shallow clones. It wouldn't surprise me if downloading tarballs is similarly beneficial to GitHub because it's trivially cacheable in a CDN and thus lowers their operational footprint to support package managers.

ilyt · on Jan 31, 2023

Well, the simplest way would be to make checksum after decompression, that doesn't need per file verify and relies on files being put in same order into tar file.

The other method would be having Manifest file with checksum of every file inside the tar and compare that in-flight, could be simple "read from tar, compare to hash, write to disk" (with maybe some tmpfiles for the bigger ones)

vlovich123 · on Jan 31, 2023

It’s not just about the integrity of the files you’re processing, but also the integrity of the archive itself. If you extract the tarball from a random place, there’s a larger security risk. Now granted HTTPS probably mitigates a lot of it, but cert pinning isn’t that common so MITM attacks aren’t thaaat theoretical.

ilyt · on Jan 31, 2023

You can do validation in flight during extraction. Signed file manifests are how distros like Debian did it since forever, althought in their cases its two step process, the packages themselves contain their own signature and whole directory tree also gets signed (to avoid shenaningans like "attacker putting older, still vulnerable, but signed version into the repo)

duped · on Jan 30, 2023

Ok, now guarantee that.

ErikCorry · on Jan 30, 2023

This seems like a weak argument.

Firstly SHA is not a secure hash.

Secondly if your build step involves uploading data to a third party then allowing them to transform it as they see fit and then checksumming the result then it's not really a reproducible build. For all you know, Github inserts a virus during the compression of the archive.

What am I missing?

ajross · on Jan 30, 2023

> Firstly SHA is not a secure hash.

It's... literally the Secure Hash Algorithm. (Yes, yes, SHA-1 was broken a while back, but SHA and derivatives were absolutely intended to provide secure collision resistance).

I think you're mixing things up here. Github didn't change the SHA-1 commit IDs in the repositories[1]. They changed the compression algorithm used for (and thus the file contents of) "git archive" output. So your tarballs have the same unpacked data but different hashes under all algorithms, secure or not.

> Secondly if your build step involves uploading data to a third party then allowing them to transform it as they see fit and then checksumming the result then it's not really a reproducible build. For all you know, Github inserts a virus during the compression of the archive.

Indeed. So you take and record a SHA-256 of the archive file you are tagging such that no one can feasibly do that!

Again, what's happened here is that the links pointing to generated archive files that projects assumed were immutable turned out not to be. It's got nothing to do with security or cryptography.

[1] Which would be a whole-internet-breaking catastrophe, of course. They didn't do that and never will.

chlorion · on Jan 31, 2023

>Firstly SHA is not a secure hash.

This is incorrect, but even if it were true, you could use whatever your hash of choice is instead. Gentoo for example can use whatever hash you like, such as blake2, and the default Gentoo repo captures both the sha512 and blake2 digests in the manifest.

Sha1 is still used for security purposes anyways, even though it really shouldn't be!

Signing git commits still relies on sha1 for security purposes, which I think many people don't realize.

Commit signing only signs the commit object itself, other objects such as the trees, blobs and tags are not involved directly in the signature. The commit object contains sha1 hashes to it's parents, and to a root tree. Since trees contain hashes of all of their items, it creates a recursive chain of hashes of the entire contents of the repo during that point in time!

So signed commits rely entirely on the security of sha1 for now!

You may have already knew all of this about git signing but I thought it might be interesting to mention.

blueflow · on Jan 30, 2023

1) SHA-256 is reasonably secure

2) The checksum assures you that the file you have is the same your upstream looked at

ErikCorry · on Jan 31, 2023

1) Ah of course, this is SHA256, my mistake.

2) If I and the upstream are both looking at a file that was generated by Github then the Sha may match, but that doesn't prove we weren't both owned by Github.

Perhaps what I am missing is that this isn't part of a reproducible build scenario. There's no attempt to ensure that the file Github had built is the one I would build with the same starting point.

blueflow · on Jan 31, 2023

If you trust your upstream, then the checksum is enough. If you don't trust your upstream, its sort of an RCE anyways.

IanCal · on Jan 30, 2023

I think the reproducible build part is about projects that depend on these outputs. The goal is ensuring you and I have both pulled exactly the same dependencies.

Zababa · on Jan 31, 2023

They're all waiting for your pull requests.

naikrovek · on Jan 30, 2023

the change was to git, not GitHub.

nick__m · on Feb 4, 2023

Sorry, I missread the Github annonce and incorrectly interpreted it.

pxc · on Jan 30, 2023

Nixpkgs' so-called binary cache actually also caches source tarballs. Any Nix users out there who ran updates during the change?

Did cache hits save you? Did cache misses break your builds?

anderskaseorg · on Jan 30, 2023

Nixpkgs’s fetchFromGitHub function hashes the contents of GitHub archives after unpacking, so it’s unaffected.

pxc · on Jan 31, 2023

I should have remembered this! Nixpkgs committers are consistently mindful of things like this in code reviews.

clhodapp · on Jan 30, 2023

I could be wrong but believe that nix should be safe for the most part because it does a recursive hash of the stuff it cares about on the extraction of these archives.

jkachmar · on Jan 30, 2023

didn’t realize this had happened until i logged off of my work computer & saw someone had shared this thread in a group chat.

looks like we were completely unaffected, as no one made any updates to derivations referencing GitHub sources in a way that invalidated old entries (i.e. no version bumps, new additions, etc.).

WayToDoor · on Jan 30, 2023

https://github.com/orgs/community/discussions/45830#discussi...

> Hey folks. I'm the product manager for Git at GitHub. We're sorry for the breakage, we're reverting the change, and we'll communicate better about such changes in the future (including timelines).

skobovm · on Jan 30, 2023

I wonder what monetary loss in productivity was due to this change. We noticed this issue a bit before noon, tracked it down to GH, sent out company-wide comms notifying others of the problem, filed tickets with GH, had to modify numerous repos across multiple teams, and now it's 3pm and I'm here reading about it.

It's crazy how such a seemingly innocuous change, like this, could lead to such widespread loss in productivity across the globe.

misnome · on Jan 30, 2023

Our conda-forge package builds broke. We had someone declare to us that tag downloads were never stable, just releases. This seems to be the opposite of the known truth about the previous status quo - but does go some way to demonstrating how little the state of the actual guarantees for this system were understood.

wildfire · on Jan 30, 2023

See https://github.com/orgs/community/discussions/45830 for the fallout.

kelnos · on Jan 31, 2023

The thing I don't get is how this ever worked.

The change was upstream from git itself, and it was to use the builtin (zlib-based) compression code in git, rather than shelling out to gzip.

But would the gzip binary itself give reproducible results across versions of gzip (and zlib)? Intuition seems to suggest it wouldn't, at least not always. And if not, was the "strategy" just to never update gzip or zlib on GitHub's servers? That seems like a non-starter...

FeepingCreature · on Jan 31, 2023

gzip is 28 years old. I don't think the output changes anymore.

account42 · on Jan 31, 2023

There is no reason to believe that it won't. Even after 28 years, there could be improvements merged for the compressor. Or perhaps especially after 28 years - we have a lot more memory now but it is slower when compared to our CPUs than it used to be so there is most likely room for tuning. Similar for patches that make use of newer CPU instructions - why would you expect them to take care to produce the exact same output rather than just the best compression possible for a perf budget.

ihattendorf · on Jan 31, 2023

That's the whole point, it wasn't an enforced contract but just happened to not change in a long time so it was assumed to be part of the contract. The majority of users don't know how exactly GitHub is serving these archives, they just assume (incorrectly, but reasonably) if they download from this URL they'll always get the same archive bit for bit. That assumption has grown stronger and stronger over time the longer they remained the same, until today.

jzelinskie · on Jan 30, 2023

Does anyone have the motivation for why the git project wants to use their own implementation of gzip? Did this implementation already exist and was being used for something else?

I understand wanting fewer dependencies, but gut-reaction is that it's a bad move in the unsafe world of C to rewrite something that already has a far more audited, ubiquitous implementation.

nemetroid · on Jan 30, 2023

They're still using zlib to do the heavy lifting. It's not a large patch.

https://public-inbox.org/git/1328fe72-1a27-b214-c226-d239099...

capableweb · on Jan 30, 2023

> So the internal implementation takes 17% longer on the Linux repo, but

> uses 2% less CPU time. That's because the external gzip can run in

> parallel on its own processor, while the internal one works sequentially

> and avoids the inter-process communication overhead.

> What are the benefits? Only an internal sequential implementation can

> offer this eco mode, and it allows avoiding the gzip(1) requirement.

It seems like they changed it because it uses less CPU, which makes sense in a "we're a global git hosting company" perspective, but less so for users who run the command themselves. They intentionally made it 17% slower to save 2% of CPU time, which probably makes sense at their scale, but for every user who run the command locally to lose 17% more of time?

Twirrim · on Jan 31, 2023

This was a change in the upstream git project, I don't think it came from GitHub necessarily?

Looks like the author is the maintainer of "Git for Windows", and similar, which I can imagine makes for a reasonable argument for reducing dependencies. zlib is already a library dependency, just use that instead of needing people to bundle up a gzip binary along with git, too.

https://lore.kernel.org/git/pull.145.git.gitgitgadget@gmail....

pixl97 · on Jan 30, 2023

Because they pay for the 2% CPU time, not for the 17% local time. In theory the user also pays for 2% less CPU time, but they are much less likely to be CPU limited in their build processes.

Of course 17% more time may not really be that much for most processes. Are we talking about 17% more of a second or of an hour?

jeffbee · on Jan 30, 2023

It seems like if they really wanted to save CPU they'd be caching the outputs. I fail to see why they would be recompressing years-old release tags. This seems like optimization at the wrong level.

That's without even mentioning the absurdity of saving 2% CPU but still using zlib.

semiquaver · on Jan 31, 2023

“Their own” implementation is just zlib, already in use throughout git since the dawn of the project for other purposes like blob storage [1].

Depending on how you measure it, zlib might be considered significantly more ubiquitous than gzip itself. At any rate it’s certainly no less battle tested.

[1] https://git-scm.com/book/en/v2/Git-Internals-Git-Objects

groestl · on Jan 30, 2023

I think "Drop the dependency on gzip" for something like Git trumps a bit more exposure (which can be mitigated with thorough reviews).

Aissen · on Jan 31, 2023

It was publicly known that Github was breaking automatic git archives consistency for many years. Here is a bug on a project to stop relying on fake github archives (as opposed to stable git-archive(1)):

https://bugzilla.tianocore.org/show_bug.cgi?id=3099

At some point it was impossible to go a few weeks (or even days) without a github archive change (depending on which part of the "CDN" you hit), I guess they must have stabilized it at some point. Here is an old issue before GitHub had a community issue tracker:

https://github.com/isaacs/github/issues/1483

I am glad this is getting more attention, maybe now github will finally have a stable endpoint for archives.

doubleunplussed · on Jan 30, 2023

Ah, this will presumably break some Arch Linux AUR packages. Preparing for bug reports.

elesiuta · on Jan 30, 2023

I always anticipated something like this could happen and it bothered me enough to create my own workflow [1] to archive, hash, and attach it to each release automatically for my AUR package. I can see how most people wouldn't notice/bother with such a small detail though, so I am not at all surprised by the fallout this caused.

[1] https://github.com/elesiuta/picosnitch/blob/master/.github/w...

frankjr · on Jan 30, 2023

Yep, it has already broken labwc for me.

    ==> Validating source files with b2sums...
        labwc-0.6.1.tar.gz ... FAILED
    ==> ERROR: One or more files did not pass the validity check!

lopkeny12ko · on Jan 31, 2023

I can't fathom how no one internally at Microsoft-Github realized how widespread the breakage would be before rolling this out to all public users.

Surely, Microsoft-Github's own internal builds would have started failing as a result of this change? Or do they not even canary releases internally at all?

ilyt · on Jan 31, 2023

I can

"didn't read every commit in new version of git, realized after the fact"

medellin · on Jan 30, 2023

Im thinking of all the bazel build rules that are about to break from my last company. Someone will have a fun day updating hundreds of hashes.

ErikCorry · on Jan 30, 2023

Do they let Github generate the archives as one of the build rules instead of performing the archival and compression locally and uploading the result?

medellin · on Jan 30, 2023

Correct. Silly stuff like this happens when you don’t have systems in place that make it easy to store your own artifacts. Additionally a lot of people just want to get things done as quick as possible even if you have the tools in place.

jart · on Jan 30, 2023

If they're using multiple URLs like a good Bazel user then they shouldn't be impacted.

thirtyseven · on Jan 30, 2023

The setup instructions for almost [1] every [2] major [3] rule set [4] only provide one (GitHub) url in the Starlark blob you're supposed to copy and paste, so hard to blame users here.

[1] https://github.com/bazelbuild/rules_jvm_external/releases/ta...

[2] https://github.com/bazelbuild/rules_python/releases/tag/0.17...

[3] https://github.com/bazelbuild/rules_java/releases/tag/5.4.0

[4] https://github.com/bazelbuild/rules_scala

jart · on Jan 30, 2023

I agree. The Bazel developers failed in their leadership.

SuperSandro2000 · on Jan 31, 2023

From a distro maintainer perspective every project that is only buildable with bazel is an absolute nightmare.

jart · on Jan 31, 2023

I created https://gist.github.com/jart/082b1078a065b79949508bbe1b7d8ef... to solve that, by turning bazel projects into makefiles. The problem is the bazel team has broken the apis that make it possible so many times since then because they reacted very negatively to the idea.

medellin · on Jan 30, 2023

They did where applicable but i know that not all of them had multiple

jart · on Jan 30, 2023

Well now they know why it's so important. https://github.com/bazelbuild/bazel/commit/ed7ced0018dc5c5eb...

UncleOxidant · on Jan 30, 2023

Lol... I was being burned by this just about an hour ago. Cloned a repo, did a build of the project (which uses bezel to fetch dependencies) and it reported errors due to mismatch in expected checksums.

hamandcheese · on Jan 31, 2023

The fact that this is causing problems seems like a flaw in Bazel, imo. Nix, for example, calculates a hash of the contents of a tarball, rather than a hash of the tarball itself.

rfoo · on Jan 31, 2023

Yep, Nix not affected at all is pretty impressive.

On the other hand this goes against the "verify before parse" principle so I have mixed feelings on Nix's approach.

Foxboron · on Jan 31, 2023

They don't really do any source authentication at all. There is no strategy for checking gpg/minisign/whatever signatures and fetching keys to validate these things.

ArchOversight · on Jan 30, 2023

I remember a similar breakage happening before due to internal git changes, and thought it was common knowledge to upload your own signed tarballs for releases.

rektide · on Jan 30, 2023

Now please give us compression options beyond gzip? :) Some zstd & lz4 please?

metrognome · on Jan 30, 2023

I wonder if this incident will encourage our industry to build more robust forms of artifact integrity verification, or if we will instead codify the status quo of "we guarantee repos to be archived deterministically." To me, the latter seems like a more troubling precedent.

bentley · on Jan 30, 2023

We’ve regressed from the previous norm of open source projects providing stable source tarballs with fixed checksums, sometimes even with cryptographic signatures.

reindeerer · on Jan 31, 2023

That norm still exists, and it's offered by Github in form of Github Releases feature as well.

It's the downstream tooling ( i.e. all the builds and package managers ) that need to clean their act up.

JonChesterfield · on Jan 31, 2023

If the source tar changes, how do you propose the downstream tooling distinguishes between data corruption, MITM attack and upstream deciding to change the number without notifying anyone?

reindeerer · on Feb 6, 2023

That's the whole point, source tars when properly versioned don't change. And you can get unchanged versions from any mirror in the world. sha256 of linux-2.6.10 release is 404e33da7c1bf271e0791cd771d065e19a2b1401ef8ebb481a60ce8ddc73e131, it wont change

rswail · on Jan 31, 2023

This is being driven in industry by the push by US FedGov (via NIST) to have supply chain verification after the recent hacks.

POTUS issued an EO and NIST have been following up, leading to the promotion of schemes such as spdx https://tools.spdx.org/app/about/

Where I work is also required to start documenting our supply chain as part of the (new, replacing PCI-DSS) PCI-SFF certification requirements, which requires end-to-end verification of artifacts that are deployed within PCI scope.

So really, the arguments about CPU time etc are basically silly. The use of SHA hashes for artifacts that don't change will be a requirement for anyone building industrial software, or supplying to government, or in the money transacting business.