Some people *don’t* put their code on GitHub, since they object to GitHub’s ToS,...

chatmasta · on Sept 2, 2023

The CoPilot team could pull the code from PyPi and use it to train their models, regardless of whether it's on GitHub. If you don't want AI trained on your code, then either don't publish it, or publish it somewhere that forbids (and preferably enforces) AI companies to index or train on it. But good luck with that... it's public code. Don't publish it if you don't want humans or machines to read it.

belorn · on Sept 3, 2023

They could do that, just as much as someone could pull videos and songs from youtube and use that to train a model. Its public content, so if people wanted humans and machines to not access it then don't publish on a public platform.

One can argue about ToS and copyright, about different interpretations of fair use, derivative work, DRM protections, and so on. Usually people are not interested to discuss finer details of those things. Most people seems to want to perceive it as either being public or not public, in which case, Youtube is just as public as PyPi.

teddyh · on Sept 2, 2023

If Microsoft could legally do that, then why is that clause present in the GitHub ToS?

chatmasta · on Sept 2, 2023

Because you're pushing the code to GitHub, so they need to enumerate their rights in terms of what they can do with it once you push it there. But if you publish your code to PyPi, the relevant ToS is the PyPi ToS, which has no such clause forbidding either PyPi or others from using the code how they'd like to (and as mentioned by other comments in this thread, the ToS actually explicitly grants others the right to republish the code).

theaiquestion · on Sept 3, 2023

Because most company's/people are (AFAIK) under the impression that training on public data will fall under "Fair Use" because it's substantially transformative, in the case that it isn't then you've already agreed to it on github.

It's a fallback clause, "fair use" is irrelevant if you've already given github permission to use it. By adding that clause you can no longer argue that it's not fair use to use the code you put on github after agreeing to their terms.

schemescape · on Sept 2, 2023

Out of my own curiosity, which clause in the GitHub TOS are you referring to?

teddyh · on Sept 8, 2023

I don’t know, but it seems to be this one: <https://news.ycombinator.com/item?id=37425482>

labster · on Sept 2, 2023

The GPL should really be updated to say that any code produced from machine learning on GPL code is also GPL licensed.

dragonwriter · on Sept 3, 2023

Which would mean nothing at all if any of the “training models on code doesn’t require permission in the first place” theories (Fair Use or otherwise) is true, and pretty much all current models collapse into illegality if at least one of those theories isn’t true.

You can’t use a license to bind people who don't need a license.

chatmasta · on Sept 2, 2023

Yes, or there should be a ROBOTS.TXT file that describes how the code in a directory may be indexed or used by machines (e.g. malware scanning okay, no LLM training, etc.) But you're probably correct that such rights should just be covered by the license itself.

Your reciprocity suggestion could also work, since it would mean any LLM trained on even a single file of GPL code would be "poisoned" in the sense that any code it produced would also be GPL. This would make people wary of using the LLM for writing code, and thus would make the creators of it wary of training it on the code.

albertzeyer · on Sept 2, 2023

You also don't say that if a human has learned from GPL code, that all code that this human produces in the future is GPL licensed.

contravariant · on Sept 3, 2023

Humans are also not allowed to simply regurgitate GPL code verbatim, even if they do it by committing the code to memory. There's a reason clean room implementations are a thing, sometimes even the appearance of someone possibly remembering code verbatim is risky enough to warrant extra care. That said usually the risks are acceptable even without extra measures because you can hold the humans responsible.

Now just because you don't understand a language model and call its fitting procedure 'learning' doesn't mean that it is doing anything even remotely similar. And even if it does then it has no legal responsibilities so if you want to distribute it then you as the distributor need to assume responsibility for all the copyrights such an act could possibly infringe.

There are measures you can take to try to prevent the information from any one code base from entering the model verbatim, by severely restricting the model size or carefully obfuscating the data, but to my knowledge nobody has used any method that gives any proven measure of security.

olalonde · on Sept 3, 2023

If an AI model regurgitates GPL licensed code verbatim, that code is already protected by copyright and there is no need to update the GPL to cover it specifically.

CableNinja · on Sept 3, 2023

Except when the GPL header is missing from said code and the user has no idea it is protected, and/or the developer has no idea it was stolen

olalonde · on Sept 3, 2023

If the license is missing, it is a violation of the license and there is legal recourse. Just like if someone did the same thing manually.

CableNinja · on Sept 3, 2023

Yes, and thats my point.

AI wont include it, user wont know it should be GPL, developer wont know it was stolen.

This is a whole can of problematic worms

olalonde · on Sept 4, 2023

This can be solved with automated tools that review your code for copyright violations.

But to be honest, this is a non-issue. Copilot (and co) rarely outputs copyrighted code verbatim and when it does, it's usually too small or too trivial to fall under copyright protection. I made a prediction last year that no one will get sued from using Copilot generated code and I believe it has held so far[0].

[0] https://news.ycombinator.com/item?id=31849027

albertzeyer · on Sept 3, 2023

I understand language models quite well. I have published research on this on renowned conferences.

I also know how they learn. And I know how the biological brain learns. The differences are just technical. The underlying concept is really the same: Both learn by adjusting the neuronal strengths between neurons based on what they see.

Legal responsibilities is sth different. This is up to politics to decide.

My argument is purely on the fact that what humans do and what machines do is extremely similar, and the differences are just technical and don't really matter here. This is often misunderstood by people who think that machines don't really learn like humans.

rfw300 · on Sept 3, 2023

Computers aren’t humans. There’s no reason licenses should treat them the same way when completely different considerations of fairness and economic impact apply.

xigoi · on Sept 3, 2023

You can very much be sued for working on a proprietary project and then trying to reproduce it somewhere else. That's why clean-room reimplementations have to make sure that no contributor has seen the original code.

xigoi · on Sept 3, 2023

Even better, the model itself should be GPL licenced, since it's clearly a derivative work.

gaganyaan · on Sept 2, 2023

I think it's an open question if that would actually work. I would guess that if the courts decided that worked, we'd see a GPL 4 with that sort of clause.

justincormack · on Sept 3, 2023

Currently the US says that nothing generated by AI has copyright at all, it is all public ___domain.

jbaber · on Sept 2, 2023

This is an interesting idea.

orf · on Sept 2, 2023

They are within their rights to make that choice, but when you publish a package to PyPI you agree to their terms which gives anyone the right to mirror, distribute and otherwise use the code you’ve published.

kzrdude · on Sept 2, 2023

The rights you find in the PyPI terms, do they provide everything you need to comply with the Github terms? Ultimately it's tricky to understand what Github really means with their terms (they say User-generated contents a lot.)

uxp8u61q · on Sept 3, 2023

The terms of PyPI give anyone the right to mirror. Do they give anyone the right to mirror on github, is the question.

pjc50 · on Sept 3, 2023

Clearly yes. It doesn't say any restrictions.

Code which has a "can't be placed on github" license restriction is definitely not open source, regardless of what other terms the license purports to have.

xigoi · on Sept 3, 2023

GitHub does more to the code than just mirror it, so having the right to mirror is not enough.

If trying to prevent your software from being used to create proprietary software makes it not open-source, is the ONU GPL not an open-source license?

dragonwriter · on Sept 3, 2023

> Some people don’t put their code on GitHub, since they object to GitHub’s ToS, especially those pertaining to analysis and use by Copilot

Copilot was trained on Github code under a “training models doesn’t require permission” theory before there was anything about it in the ToS, and basically every other large model has taken a similar approach to publicly-accessible data of all kinds.

> Will Microsoft see this as a free license to use all of PyPI

Microsoft doesn’t think they need a license for model training.