Which would mean nothing at all if any of the “training models on code doesn’t require permission in the first place” theories (Fair Use or otherwise) is true, and pretty much all current models collapse into illegality if at least one of those theories isn’t true.
You can’t use a license to bind people who don't need a license.
Yes, or there should be a ROBOTS.TXT file that describes how the code in a directory may be indexed or used by machines (e.g. malware scanning okay, no LLM training, etc.) But you're probably correct that such rights should just be covered by the license itself.
Your reciprocity suggestion could also work, since it would mean any LLM trained on even a single file of GPL code would be "poisoned" in the sense that any code it produced would also be GPL. This would make people wary of using the LLM for writing code, and thus would make the creators of it wary of training it on the code.
Humans are also not allowed to simply regurgitate GPL code verbatim, even if they do it by committing the code to memory. There's a reason clean room implementations are a thing, sometimes even the appearance of someone possibly remembering code verbatim is risky enough to warrant extra care. That said usually the risks are acceptable even without extra measures because you can hold the humans responsible.
Now just because you don't understand a language model and call its fitting procedure 'learning' doesn't mean that it is doing anything even remotely similar. And even if it does then it has no legal responsibilities so if you want to distribute it then you as the distributor need to assume responsibility for all the copyrights such an act could possibly infringe.
There are measures you can take to try to prevent the information from any one code base from entering the model verbatim, by severely restricting the model size or carefully obfuscating the data, but to my knowledge nobody has used any method that gives any proven measure of security.
If an AI model regurgitates GPL licensed code verbatim, that code is already protected by copyright and there is no need to update the GPL to cover it specifically.
This can be solved with automated tools that review your code for copyright violations.
But to be honest, this is a non-issue. Copilot (and co) rarely outputs copyrighted code verbatim and when it does, it's usually too small or too trivial to fall under copyright protection. I made a prediction last year that no one will get sued from using Copilot generated code and I believe it has held so far[0].
I understand language models quite well. I have published research on this on renowned conferences.
I also know how they learn. And I know how the biological brain learns. The differences are just technical. The underlying concept is really the same: Both learn by adjusting the neuronal strengths between neurons based on what they see.
Legal responsibilities is sth different. This is up to politics to decide.
My argument is purely on the fact that what humans do and what machines do is extremely similar, and the differences are just technical and don't really matter here. This is often misunderstood by people who think that machines don't really learn like humans.
Computers aren’t humans. There’s no reason licenses should treat them the same way when completely different considerations of fairness and economic impact apply.
You can very much be sued for working on a proprietary project and then trying to reproduce it somewhere else. That's why clean-room reimplementations have to make sure that no contributor has seen the original code.
I think it's an open question if that would actually work. I would guess that if the courts decided that worked, we'd see a GPL 4 with that sort of clause.