Hacker News new | past | comments | ask | show | jobs | submit login

> sign archives with email/___domain/document certificates

I do a bit of web archival for fun, and have been thinking about something.

Currently I save both response body and response headers and request headers for the data I save from the net.

But I was thinking that maybe if instead of just saving that, I could go a level deeper and preserve actual TCP packets and TLS key exchange stuff.

And then, I might be able to get a lot of data provenance “for free”. Because if in some decades when we look back at the saved TCP packets and TLS stuff, we would see that these packets were signed with a certificate chain that matches what that website was serving at the time. Assuming of course that they haven’t accidentally leaked their private keys in the meantime and that the CA hasn’t gone rogue since etc.

To me I think that would make sense to build out web archival infra that preserves the CA chain and enough to be able to see later that it was valid. And if many people across the world save the right parts we don’t have to trust each other in order to verify that data that the other saved was also really sent by the website our archives say it was from.

For example maybe I only archived a single page from some ___domain, and you saved a whole bunch of other pages from that ___domain around the same time so the same certificate chain was used in the responses to both of us. Then I can know that the data you are saying you archived from them really was served by their server because I have the certificate chain I saved to verify that.




In terms of tooling there's scoop[0] which does a lot of the capture part of what you're thinking about. The files it creates include request headers and responses, TLS certificates, PDF and screenshots and it has support for signing the whole thing as proof of provenance.

Overall though I think archive.org is probably sufficient proof that a specific page had certain content on a certain day for most purposes today.

0. https://github.com/harvard-lil/scoop


The idea is good, as far as I understand TLS however, the cert / asymmetric key is only used prove the identity/authenticity of the cert and thus the host for this session.

But the main content is not signed / checksummed with it, but rather a symmetrical session key, so one could probably manipulate this in the packet dump anyway.

I read about a Google project named SXG (Signed HTTP exchanges) that might do related stuff, albeit likely requiring the assistance of the publisher


"TLS-N", "TLS Sign", and maybe a couple others were supposed to add non-repudiation.

But they didn't really go anywhere:

https://security.stackexchange.com/questions/52135/tls-with-...

https://security.stackexchange.com/questions/103645/does-ssl...

There are some special cases, like I think certain headers for signing e-mails, that do provide non-repudiation.

For that, `tcpdump` with `SSLKEYLOGFILE` will probably get you started on capturing what you need.


To extend this to archival integrity without cooperation from the server/host, you'd need the client to sign the received bytes.

But then you need the client to be trusted, which clashes with distributing.

Hypothetically, what about trusted orgs standing up an endpoint that you could feed a URL, then receive back attestation from them as to the content, then include that in your own archive?

Compute and network traffic are pretty cheap, no?

So if it's just grabbing the same content you are, signing it, then throwing away all the data and returning you the signed hash, that seems pretty scalable?

Then anyone could append that to their archive as a certificate of authenticity.


Reminds me of timestamp protocol and timestamp authorities.

Not quite the same problem, but similar enough to have a similar solution. https://www.ietf.org/rfc/rfc3161.txt


Unfortunately, the standard TLS protocol does not provide a non-repudiation mechanism.

It works by using public key cryptography and key agreement to get both parties to agree on a symmetric key, and then uses the symmetric key to encrypt the actual session data.

Any party who knows the symmetric key can forge arbitrary data, and so a transcript of a TLS session, coupled with the symmetric key, is not proof of provenance.

There are interactive protocols that use multi-party computation (see for example https://tlsnotary.org/) where there are two parties on the client side, plus an unmodified server. tlsnotary only works for TLS1.2. One party controls and can see the content, but neither party has direct access to the symmetric key. At the end, the second party can, by virtue of interactively being part of the protocol, provably know a hash of the transaction. If the second party is a trusted third party, they could sign a certificate.

However, there is not a non-interactive version of the same protocol - you either need to have been in the loop when the data was archived, or trust someone who was.

The trusted third party can be a program running in a trusted execution environment (but note pretty much all current TEEs have known fault injection flaws), or in a cloud provider that offers vTPM attestation and a certificate for the state (e.g. Google signs a certificate saying an endorsement key is authentically from Google, and the vTPM signs a certificate saying a particular key is restricted to the vTPM and only available when the compute instance is running particular known binary code, and that key is used to sign a certificate attesting to a TLS transcript).

I'm working on a simpler solution that doesn't use multiparty computation, and provides cloud attestation - https://lemmy.amxl.com/c/project_uniquonym https://github.com/uniquonym/tls-attestproxy - but it's not usable yet.

Another solution is if the server will cooperate with a TLS extension. TLS-N (https://eprint.iacr.org/2017/578.pdf) provides a solution for this. That provides a trivial solution for provenance.


As important as cryptography is, I also wonder how much of it is trying to find technical solutions for social problems.

People are still going to be suspicious of each other, and service providers are still going to leak their private keys, and whatnot.


You may be interested in Reclaim Protocol and perhaps zkTLS. They have something very similar going and the sources are free.

https://github.com/reclaimprotocol

https://drive.google.com/file/d/1wmfdtIGPaN9uJBI1DHqN903tP9c...

https://www.reclaimprotocol.org/

https://docs.lighthouse.storage/lighthouse-1/zktls


It’s an interesting idea for sure. Some drawbacks I can think off:

- bigger resource usage. You will need to maintain a dump of the TLS session AND an easily extractable version

- difficulty of verification. OpenSSL / BoringSSL / etc. will all evolve and say, completely remove support for TLS versions, ciphers, TLS extensions… This might make many dumps unreadable in the future, or requiring the exact same version of a given software to read it. Perhaps adding the decoding binary to the dump would help, but then, you’d get Linux retro-compatibility issues.

- compression issues: new compression algorithms will be discovered and could reduce data usage. You’ll have a hard time doing that since TLS streams will look random to the compression software.

I don’t know. I feel like it’s a bit overkill — what are the incentives for tampering with this kind of data?

Maybe a simpler way of going about it would be to build a separate system that does the « certification » after the data is dumped; combined with multiple orgs actually dumping the data (reproducibility), this should be enough the prove that a dataset is really what it claims to be.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: