Hacker News new | past | comments | ask | show | jobs | submit login

The GPL should really be updated to say that any code produced from machine learning on GPL code is also GPL licensed.



Which would mean nothing at all if any of the “training models on code doesn’t require permission in the first place” theories (Fair Use or otherwise) is true, and pretty much all current models collapse into illegality if at least one of those theories isn’t true.

You can’t use a license to bind people who don't need a license.


Yes, or there should be a ROBOTS.TXT file that describes how the code in a directory may be indexed or used by machines (e.g. malware scanning okay, no LLM training, etc.) But you're probably correct that such rights should just be covered by the license itself.

Your reciprocity suggestion could also work, since it would mean any LLM trained on even a single file of GPL code would be "poisoned" in the sense that any code it produced would also be GPL. This would make people wary of using the LLM for writing code, and thus would make the creators of it wary of training it on the code.


You also don't say that if a human has learned from GPL code, that all code that this human produces in the future is GPL licensed.


Humans are also not allowed to simply regurgitate GPL code verbatim, even if they do it by committing the code to memory. There's a reason clean room implementations are a thing, sometimes even the appearance of someone possibly remembering code verbatim is risky enough to warrant extra care. That said usually the risks are acceptable even without extra measures because you can hold the humans responsible.

Now just because you don't understand a language model and call its fitting procedure 'learning' doesn't mean that it is doing anything even remotely similar. And even if it does then it has no legal responsibilities so if you want to distribute it then you as the distributor need to assume responsibility for all the copyrights such an act could possibly infringe.

There are measures you can take to try to prevent the information from any one code base from entering the model verbatim, by severely restricting the model size or carefully obfuscating the data, but to my knowledge nobody has used any method that gives any proven measure of security.


If an AI model regurgitates GPL licensed code verbatim, that code is already protected by copyright and there is no need to update the GPL to cover it specifically.


Except when the GPL header is missing from said code and the user has no idea it is protected, and/or the developer has no idea it was stolen


If the license is missing, it is a violation of the license and there is legal recourse. Just like if someone did the same thing manually.


Yes, and thats my point.

AI wont include it, user wont know it should be GPL, developer wont know it was stolen.

This is a whole can of problematic worms


This can be solved with automated tools that review your code for copyright violations.

But to be honest, this is a non-issue. Copilot (and co) rarely outputs copyrighted code verbatim and when it does, it's usually too small or too trivial to fall under copyright protection. I made a prediction last year that no one will get sued from using Copilot generated code and I believe it has held so far[0].

[0] https://news.ycombinator.com/item?id=31849027


I understand language models quite well. I have published research on this on renowned conferences.

I also know how they learn. And I know how the biological brain learns. The differences are just technical. The underlying concept is really the same: Both learn by adjusting the neuronal strengths between neurons based on what they see.

Legal responsibilities is sth different. This is up to politics to decide.

My argument is purely on the fact that what humans do and what machines do is extremely similar, and the differences are just technical and don't really matter here. This is often misunderstood by people who think that machines don't really learn like humans.


Computers aren’t humans. There’s no reason licenses should treat them the same way when completely different considerations of fairness and economic impact apply.


You can very much be sued for working on a proprietary project and then trying to reproduce it somewhere else. That's why clean-room reimplementations have to make sure that no contributor has seen the original code.


Even better, the model itself should be GPL licenced, since it's clearly a derivative work.


I think it's an open question if that would actually work. I would guess that if the courts decided that worked, we'd see a GPL 4 with that sort of clause.


Currently the US says that nothing generated by AI has copyright at all, it is all public ___domain.


This is an interesting idea.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: