I wonder if this incident will encourage our industry to build more robust forms of artifact integrity verification, or if we will instead codify the status quo of "we guarantee repos to be archived deterministically." To me, the latter seems like a more troubling precedent.
We’ve regressed from the previous norm of open source projects providing stable source tarballs with fixed checksums, sometimes even with cryptographic signatures.
If the source tar changes, how do you propose the downstream tooling distinguishes between data corruption, MITM attack and upstream deciding to change the number without notifying anyone?
That's the whole point, source tars when properly versioned don't change. And you can get unchanged versions from any mirror in the world. sha256 of linux-2.6.10 release is 404e33da7c1bf271e0791cd771d065e19a2b1401ef8ebb481a60ce8ddc73e131, it wont change
This is being driven in industry by the push by US FedGov (via NIST) to have supply chain verification after the recent hacks.
POTUS issued an EO and NIST have been following up, leading to the promotion of schemes such as spdx https://tools.spdx.org/app/about/
Where I work is also required to start documenting our supply chain as part of the (new, replacing PCI-DSS) PCI-SFF certification requirements, which requires end-to-end verification of artifacts that are deployed within PCI scope.
So really, the arguments about CPU time etc are basically silly. The use of SHA hashes for artifacts that don't change will be a requirement for anyone building industrial software, or supplying to government, or in the money transacting business.
Oh, I'm not arguing that using checksums, SHA for example, for integrity verification is a bad idea. That's what they're designed for, after all.
However, I do think it's a bad idea to enforce the content of compressed archives to be deterministic. tar has never specified an ordering of its contents. Compression algorithms are parameterized for time and space, so their output should not be deterministic either. Both of these principles apply to zip as well. But we now have a situation where we are depending on both the archive format and the compression algorithm to produce a deterministic output. If we expect archives to behave this way in general, we set a bad precedent for all sorts of systems, not just git and GitHub.