If you've got one cert with a subject alt name for each host, they'd see them all. If you use SNI and they have different certificates, the domains might still be in Certificate Transparency logs. If a wildcard cert is used, that could help to conceal the exact subdomain.
2. How do you avoid this becoming a supercookie tracking solution that badly impacts privacy? Zero-knowledge proofs provide some help here - there are ways to create an ID that changes on a certain frequency and is different per site, but different IDs can't be correlated, preventing long term tracking and cross-site tracking, while still providing enough to rate-limit per natural person.
3. How do you stop people selling their identity to scrapers? This is a hard one to solve, but there are protocols that make it harder without giving up sensitive information or being interactively involved on an ongoing basis.
Unfortunately, the standard TLS protocol does not provide a non-repudiation mechanism.
It works by using public key cryptography and key agreement to get both parties to agree on a symmetric key, and then uses the symmetric key to encrypt the actual session data.
Any party who knows the symmetric key can forge arbitrary data, and so a transcript of a TLS session, coupled with the symmetric key, is not proof of provenance.
There are interactive protocols that use multi-party computation (see for example https://tlsnotary.org/) where there are two parties on the client side, plus an unmodified server. tlsnotary only works for TLS1.2. One party controls and can see the content, but neither party has direct access to the symmetric key. At the end, the second party can, by virtue of interactively being part of the protocol, provably know a hash of the transaction. If the second party is a trusted third party, they could sign a certificate.
However, there is not a non-interactive version of the same protocol - you either need to have been in the loop when the data was archived, or trust someone who was.
The trusted third party can be a program running in a trusted execution environment (but note pretty much all current TEEs have known fault injection flaws), or in a cloud provider that offers vTPM attestation and a certificate for the state (e.g. Google signs a certificate saying an endorsement key is authentically from Google, and the vTPM signs a certificate saying a particular key is restricted to the vTPM and only available when the compute instance is running particular known binary code, and that key is used to sign a certificate attesting to a TLS transcript).
Another solution is if the server will cooperate with a TLS extension. TLS-N (https://eprint.iacr.org/2017/578.pdf) provides a solution for this. That provides a trivial solution for provenance.
You can still have other limits by IPs. 429s tends to slow the scrapers, and it means you are spending a lot less on bandwidth and compute when they get too aggressive. Monitor and adjust the regex list over time as needed.
Note that if SEO is a goal, this does make you vulnerable to blackhat SEO by someone faking a UA of a search engine you care about and eating their 6 req/minute quota with fake bots. You could treat Google differently.
This approach won't solve for the case where the UA is dishonest and pretends to be a browser - that's an especially hard problem if they have a large pool of residential IPs and emulate / are headless browsers, but that's a whole different problem that needs different solutions.
For Google, just read their publicly-published list of crawler IPs. They’re broken down into 3 JSON files by category. One set of IPs is for GoogleBot (the web crawler), one is for special requests like things from Google Search Console, and one is special crawlers related to things like Google Ads.
You can ingest this IP list periodically and set rules based on those IPs instead. Makes you not prone to the blackhat SEO tactic you mentioned. In fact, you could completely block GoogleBot UA strings that don’t match the IPs, without harming SEO, since those UA strings are being spoofed ;)
I don't think analogies need to be historically accurate to be useful - and often there is something to the modern usage of them even if the original basis is a misrepresentation. Analogies and parables serve as a shortcut to aid understanding in communication, and that is valuable even if they are entirely apocryphal.
Similar examples:
* The tragedy of the commons - this is an extremely useful parable to invoke a game-theoretic / social behaviour concept. However, it is apparently historically inaccurate - at small scales, societies were able to protect and look after common lands, because of other societal factors such as reputation; the actual loss of the commons came not from overuse by the unlanded but from people who already had their own land taking more and fencing it off. Nevertheless, in larger scale societies, the principle of over-exploitation of shared resources in the absence of a mechanism to prevent that is a real and valid concern, so the analogy has a lot of value.
* The use of probably apocryphal fables like 'The boy who cried wolf' have immediate meaning when it comes to concepts like alarm fatigue.
* Many religious analogies have persisted in societies that don't hold the original religious belief. 'Holy Grail', for example, as an analogy for a desirable outcome.
* Concepts from popular fiction sometimes become analogies too. "Golden Path" for example.
Not every analogy makes sense in every circumstance, but they are useful as a mutually understood shorthand to convey concepts.
I'm not sure privacy violation is necessarily the right term to help people understand why long-term non repudiation is an undesirable property for some people.
It comes down to if a third party gets access to your emails (e.g. through a server compromise), should they be able to prove to a fourth party that the emails are legitimately yours, vs completely faked? Non repudiation through strong DKIM keys enables this.
Example: Third party is a ransomware gang who releases your emails because you didn't pay a ransom after your email server was compromised. Fourth party is a journalist who doesn't trust the ransomware gang, but also wants to publish juicy stories about your company if there is one, but doesn't want to risk their reputation / a defamation case if the ransomware gang just invented the emails.
Non-repudiation is virtually always undesirable in general-purpose messaging systems. Revealing to a stranger whether a message is valid is a concession to that stranger, not a benefit to the email's owner. This property is called "deniability" and most secure messaging systems go way out of their way to have it.
It is better to ask your internal recruiter / HR department to inform the candidate of your feedback (if you work for a big enough company). It is also good practice to always have a panel, not just the hiring manager, doing interviews.
So the candidate gets feedback along the lines of: "Thank you for participating in our interview process. Unfortunately, our panel decided you weren't the best fit for position X at this time, because ...reasons.... Under company policy, we won't accept further applications from you for one year from today, but we would encourage you to apply for a role with us in the future".
There is a chance they will reply back to HR arguing, but it is their job to be polite but firm that the decision is already made, and that they can apply again in one year (and not pass anything back to the hiring manager).
The key is to think long term and about the company as a whole - the candidate who gets helpful feedback and is treated fairly is more likely to apply again in the future (after the mandatory cooling off period), when they might have more skills and experience working somewhere else. There is a finite qualified labour pool no matter where you are based, and having the good will even of rejected candidates is a competitive advantage. The message should be "not now", rather than "not ever" (although of course, if they do go on some kind of rampage, they could turn the not now into not ever - that's a bridge burning move). If a tiny percentage go on a rampage, but the company protects the individuals from it, and has lots of counteracting positive sentiment from prospective and actual staff, then it's still a net positive.
I think it right to base the assessment of whether it is a walled garden on how easy it is for outsiders to access, and how easy it is to leave and take your community.
For viewing, I think you are doing well - your own ___domain name, which you can host where you like, and which currently doesn't impose many restrictions on who can view without signing up to anything.
But part of your community engagement is about having the community submit changes to you. And having that via GitHub is a walled garden - you can't make a PR without a GitHub account - or even search the code. And they say you are only allowed one free account - so one identity only - and I've heard credible reports they actively enforce it by IP matching etc..., and ban people if they suspect them of having two accounts.
Moving off GitHub isn't always that easy - you'd need to retrieve all your PRs, but then the problem is people who have GitHub accounts to engage with you would need to migrate their method of engagement.
So GitHub is absolutely a walled garden, and if you have a public GitHub, it is part of how you engage with your community.
Walled gardens do have the benefit of more people being in them - there is some barrier to entry to signing up on a random Gitea or Forgejo instance - but then you are beholden to the policies of the walled garden.
Fair point - I will add a note to the top that if you don't want to contribute via GitHub, you can send me a note to [email protected]. I will make the change myself.
Also imagine you are a company with a reputation for hiring people - inducing them to leave their current job - and then often dismissing them quickly afterwards.
That would give many great prospective employees pause before applying to work there, because you are asking them to give up a good thing and take a chance on your company, without commitment.
Although the model weights themselves are also outputs of the training, and interestingly the companies that train models tend to claim model weights are copyrighted.
If a set of OpenAI model weights ever leak, it would be interesting to see if OpenAI tries to claim they are subject to copyright. Surely it would be a double standard if the outcome is distributing model weights is a copyright violation, but the outputs of model inference are not subject to copyright. If they can only have one of the two, the latter point might be more important to OpenAI than protecting leaked model weights.