> So you don't see how having an automated blackbox that takes copyrighted material as an input and provides a competing alternative that can't be proven to come from the input goes against the idea of copyright protections?
Semantically, this is the same as a human reading all of Tom Clancy and then writing a fast-paced action/war/tension novel.
Is that in breach of copyright?
Copyright protects the expression of an idea. Not the idea.
> Copyright protects the expression of an idea. Not the idea.
Copyright laws were written before LLMs. Because a new technology can completely bypass the law doesn't mean that it is okay.
If I write a novel, I deserve credit for it and I deserve the right to sell it and to prevent somebody else from selling it in their name. If I was allowed to just copy any book and sell it, I could sell it for much cheaper because I didn't spend a year writing it. And the author would be screwed because people would buy my version (cheaper) and would possibly never even hear of the original author (say if my process of copying everything is good enough and I make a "Netflix of stolen books").
Now if I take the book, have it automatically translated by a program and sell it in my name, that's also illegal, right? Even though it may be harder to detect: say I translate a Spanish book to Mandarin, someone would need to realise that I "stole" the Spanish book. But we wouldn't want this to be legal, would we?
An LLM does that in a way that is much harder to detect. In the era of LLMs, if I write a technical blog, nobody will ever see it because they will get the information from the LLM that trained on my blog. If I open source code, nobody will ever see it if they can just ask their LLM to write an entire program that does the same thing. But chances are that the LLM couldn't have done it without having trained on my code. So the LLM is "stealing" my work.
You could say "the solution is to not open source anything", but that's not enough: art (movie, books, paintings, ...) fundamentally has to be shown and can therefore be trained on. LLMs bring us towards a point where open source, source available or proprietary, none of those concepts will matter: if you manage to train your LLM on that code (even proprietary code that was illegally leaked), you'll have essentially stolen it in a way that may be impossible to detect.
How in the world does it sound like it is a desirable future?
Semantically, this is the same as a human reading all of Tom Clancy and then writing a fast-paced action/war/tension novel.
Is that in breach of copyright?
Copyright protects the expression of an idea. Not the idea.