It's legally a grey area. It might even be fair use. Facts themselves are not pr...

tivert · 2025-01-27T17:47:46 1738000066

> Facts themselves are not protected by copyright.

But don't LLMs encode language, not facts?

> If there's no unauthorized reproduction/copying then it's not a copyright issue.

I'm pretty sure copyright holders have gotten the models to regurgitate their copyright works verbatim, or nearly so.

MRtecno98 · 2025-01-27T20:18:04 1738009084

We don't know what LLMs encode because we don't know what the model weights represent.

On the second point it depends how the models were made to reporduce text verbatim. If i copy-paste someone's article in MS word i technically made word reproduce the text verbatim., obviously that's not Word's fault. If i asked an LLM explicitly to list the entire Bee Movie script it would probably do it, which means it was trained on it, but that's through a direct and clear request to copy the original verbatim.

lmm · 2025-01-28T01:16:50 1738027010

> If i copy-paste someone's article in MS word i technically made word reproduce the text verbatim., obviously that's not Word's fault. If i asked an LLM explicitly to list the entire Bee Movie script it would probably do it, which means it was trained on it, but that's through a direct and clear request to copy the original verbatim.

But that clearly means that the LLM already has the Bee Movie script inside it (somehow), which would be a copyright violation. If MS word came with an "open movie script" button that let you pick a movie and get the script for it, that would clearly be a copyright violation. Of course if the user inputs something then that's different - that's not the software shipping whatever it is.

cdblades · 2025-01-27T21:49:31 1738014571

That's not a fair comparison. The user in the word example already had access to the infringing content to copy it, and then paste it into word.

But it has to have that copy, verbatim, to produce it, as you acknowledge.

If dropbox was hosting and serving IP from paramount, paramount would be able to submit a DCMA request to get that data removed.

Not only can you not submit a DMCA request to chatGPT, they can't actually obey one.

tivert · 2025-01-28T05:08:13 1738040893

> If i asked an LLM explicitly to list the entire Bee Movie script it would probably do it, which means it was trained on it, but that's through a direct and clear request to copy the original verbatim.

Huh? The "request" part doesn't matter. What you describe is exactly like if someone ships me a hard drive with a file containing "the entire Bee Movie script" that they were not authorized to copy: it's copyright infringement before and after I request the disk to read out the blocks with the file.

bee_rider · 2025-01-27T21:43:55 1738014235

I mean, it is IP law, this stuff was all invented to help big corps support their business models. So, it is impossible to predict what any of it means until we see who is willing to pay more to get their desired laws enforced. We’ll have to wait for more precedent to be purchased before us little people can figure out what the laws are.

hackingonempty · 2025-01-27T21:47:32 1738014452

Copies are made in the formation of the training corpus and in the memory of the computers during training so there's definitely a copyright issue. Could be fair use though.

icedchai · 2025-01-27T22:13:34 1738016014

Is there also a copyright issue with search engines?

hackingonempty · 2025-01-27T22:49:29 1738018169

No, the DMCA amended the law to give search engines (and automated caches and user generated content sites) safe harbor from infringement if they follow the takedown protocol.