> sign archives with email/___domain/document certificates I do a bit of web archiv...

tomatocracy · 2025-02-02T08:46:00 1738485960

In terms of tooling there's scoop[0] which does a lot of the capture part of what you're thinking about. The files it creates include request headers and responses, TLS certificates, PDF and screenshots and it has support for signing the whole thing as proof of provenance.

Overall though I think archive.org is probably sufficient proof that a specific page had certain content on a certain day for most purposes today.

0. https://github.com/harvard-lil/scoop

kro · 2025-02-02T07:41:24 1738482084

The idea is good, as far as I understand TLS however, the cert / asymmetric key is only used prove the identity/authenticity of the cert and thus the host for this session.

But the main content is not signed / checksummed with it, but rather a symmetrical session key, so one could probably manipulate this in the packet dump anyway.

I read about a Google project named SXG (Signed HTTP exchanges) that might do related stuff, albeit likely requiring the assistance of the publisher

Intralexical · 2025-02-03T01:39:08 1738546748

"TLS-N", "TLS Sign", and maybe a couple others were supposed to add non-repudiation.

But they didn't really go anywhere:

https://security.stackexchange.com/questions/52135/tls-with-...

https://security.stackexchange.com/questions/103645/does-ssl...

There are some special cases, like I think certain headers for signing e-mails, that do provide non-repudiation.

For that, `tcpdump` with `SSLKEYLOGFILE` will probably get you started on capturing what you need.

ethbr1 · 2025-02-02T15:28:04 1738510084

To extend this to archival integrity without cooperation from the server/host, you'd need the client to sign the received bytes.

But then you need the client to be trusted, which clashes with distributing.

Hypothetically, what about trusted orgs standing up an endpoint that you could feed a URL, then receive back attestation from them as to the content, then include that in your own archive?

Compute and network traffic are pretty cheap, no?

So if it's just grabbing the same content you are, signing it, then throwing away all the data and returning you the signed hash, that seems pretty scalable?

Then anyone could append that to their archive as a certificate of authenticity.

catlifeonmars · 2025-02-02T22:23:04 1738534984

Reminds me of timestamp protocol and timestamp authorities.

Not quite the same problem, but similar enough to have a similar solution. https://www.ietf.org/rfc/rfc3161.txt

A1kmm · 2025-02-02T11:32:41 1738495961

Unfortunately, the standard TLS protocol does not provide a non-repudiation mechanism.

It works by using public key cryptography and key agreement to get both parties to agree on a symmetric key, and then uses the symmetric key to encrypt the actual session data.

Any party who knows the symmetric key can forge arbitrary data, and so a transcript of a TLS session, coupled with the symmetric key, is not proof of provenance.

There are interactive protocols that use multi-party computation (see for example https://tlsnotary.org/) where there are two parties on the client side, plus an unmodified server. tlsnotary only works for TLS1.2. One party controls and can see the content, but neither party has direct access to the symmetric key. At the end, the second party can, by virtue of interactively being part of the protocol, provably know a hash of the transaction. If the second party is a trusted third party, they could sign a certificate.

However, there is not a non-interactive version of the same protocol - you either need to have been in the loop when the data was archived, or trust someone who was.

The trusted third party can be a program running in a trusted execution environment (but note pretty much all current TEEs have known fault injection flaws), or in a cloud provider that offers vTPM attestation and a certificate for the state (e.g. Google signs a certificate saying an endorsement key is authentically from Google, and the vTPM signs a certificate saying a particular key is restricted to the vTPM and only available when the compute instance is running particular known binary code, and that key is used to sign a certificate attesting to a TLS transcript).

I'm working on a simpler solution that doesn't use multiparty computation, and provides cloud attestation - https://lemmy.amxl.com/c/project_uniquonym https://github.com/uniquonym/tls-attestproxy - but it's not usable yet.

Another solution is if the server will cooperate with a TLS extension. TLS-N (https://eprint.iacr.org/2017/578.pdf) provides a solution for this. That provides a trivial solution for provenance.

Intralexical · 2025-02-03T01:44:20 1738547060

As important as cryptography is, I also wonder how much of it is trying to find technical solutions for social problems.

People are still going to be suspicious of each other, and service providers are still going to leak their private keys, and whatnot.

3np · 2025-02-02T09:12:45 1738487565

You may be interested in Reclaim Protocol and perhaps zkTLS. They have something very similar going and the sources are free.

https://github.com/reclaimprotocol

https://drive.google.com/file/d/1wmfdtIGPaN9uJBI1DHqN903tP9c...

https://www.reclaimprotocol.org/

https://docs.lighthouse.storage/lighthouse-1/zktls

whatevermom · 2025-02-02T06:51:27 1738479087

It’s an interesting idea for sure. Some drawbacks I can think off:

- bigger resource usage. You will need to maintain a dump of the TLS session AND an easily extractable version

- difficulty of verification. OpenSSL / BoringSSL / etc. will all evolve and say, completely remove support for TLS versions, ciphers, TLS extensions… This might make many dumps unreadable in the future, or requiring the exact same version of a given software to read it. Perhaps adding the decoding binary to the dump would help, but then, you’d get Linux retro-compatibility issues.

- compression issues: new compression algorithms will be discovered and could reduce data usage. You’ll have a hard time doing that since TLS streams will look random to the compression software.

I don’t know. I feel like it’s a bit overkill — what are the incentives for tampering with this kind of data?

Maybe a simpler way of going about it would be to build a separate system that does the « certification » after the data is dumped; combined with multiple orgs actually dumping the data (reproducibility), this should be enough the prove that a dataset is really what it claims to be.