> you can't copyright something that your AI generated
Seems like a loophole, if I generate synthetic data with a model trained on copyrighted works, the synthetic data is copyright free? So I can later train models on it?
You can't "launder" copyright away like that. The court will see straight through it. See "What color are your bits?" at https://ansuz.sooke.bc.ca/entry/23
There are over 200K language modeling datasets on Hugging Face, I bet a large portion of them were generated with LLMs, and all LLMs to date have been trained on copyrighted data. So they are all tainted.
But philosophically, I wonder if it's allright to block that, it techincally follows the definition of copyright. It does not carry the expression, but borrows abstractions and facts. That's exactly what is allowed.
If we move to block synthetic data, then anyone can be accused of infringement when they reuse abstractions learned somewhere else. Creativity would not be possible.
On the other hand models trained on synthetic data will never regurgitate the originals because they never saw them.
Seems like a loophole, if I generate synthetic data with a model trained on copyrighted works, the synthetic data is copyright free? So I can later train models on it?