Hacker News new | past | comments | ask | show | jobs | submit login

So who’s gonna sue an AI company asserting that all code they produce is GPL due to being trained on GPL code?





How is training a model on GPL code and then having it write code any different to having a human read GPL code and then write code?

Unless there's a specific copyright claim over a specific piece of code that was copied and published, it's hard to see how the GPL has any relevance.


Because, unlike humans, LLMs reliably reproduce exact excerpts from their training data. It's very easy to get image generation models to spit out screenshots from movies.

That doesn't mean that all of the output from an LLM trained on GPL code is a derivative work (and therefore GPL'd too).

A model that provably engages in systematic, difficult-to-detect plagiarism must itself be considered plagiaristic.

I see that argument over and over, and I don't understand how people can consider it makes sense.

"My clipboard learned the code, just like a human would. So it should be fine to copy-paste anything and call it my own".

"How is killing a human any different to killing a computer?"

"If humans can vote, why couldn't computers vote as well?"

Can we start at "humans are not computers", maybe?


> Can we start at "humans are not computers", maybe?

Sure. So it stands to reason that "computers" are not bound by human laws. So an LLM that finds a piece of copyright data out there on the internet, downloads it, and republishes it has not broken any law? It certainly can't be prosecuted.

My original point was that copyright protections are about (amongst other things) protecting distribution and derivative works rights. I'm not seeing a coherent argument that feeding a copyrighted work (that you obtained legally) into a machine is breaching anyone's copyright.


> So an LLM that finds a piece of copyright data out there on the internet, downloads it, and republishes it has not broken any law?

Are you even trying? A gun that kills a person has not broken any law? It certainly can't be prosecuted.

> I'm not seeing a coherent argument that feeding a copyrighted work (that you obtained legally) into a machine is breaching anyone's copyright.

So you don't see how having an automated blackbox that takes copyrighted material as an input and provides a competing alternative that can't be proven to come from the input goes against the idea of copyright protections?


> So you don't see how having an automated blackbox that takes copyrighted material as an input and provides a competing alternative that can't be proven to come from the input goes against the idea of copyright protections?

Semantically, this is the same as a human reading all of Tom Clancy and then writing a fast-paced action/war/tension novel.

Is that in breach of copyright?

Copyright protects the expression of an idea. Not the idea.


> Copyright protects the expression of an idea. Not the idea.

Copyright laws were written before LLMs. Because a new technology can completely bypass the law doesn't mean that it is okay.

If I write a novel, I deserve credit for it and I deserve the right to sell it and to prevent somebody else from selling it in their name. If I was allowed to just copy any book and sell it, I could sell it for much cheaper because I didn't spend a year writing it. And the author would be screwed because people would buy my version (cheaper) and would possibly never even hear of the original author (say if my process of copying everything is good enough and I make a "Netflix of stolen books").

Now if I take the book, have it automatically translated by a program and sell it in my name, that's also illegal, right? Even though it may be harder to detect: say I translate a Spanish book to Mandarin, someone would need to realise that I "stole" the Spanish book. But we wouldn't want this to be legal, would we?

An LLM does that in a way that is much harder to detect. In the era of LLMs, if I write a technical blog, nobody will ever see it because they will get the information from the LLM that trained on my blog. If I open source code, nobody will ever see it if they can just ask their LLM to write an entire program that does the same thing. But chances are that the LLM couldn't have done it without having trained on my code. So the LLM is "stealing" my work.

You could say "the solution is to not open source anything", but that's not enough: art (movie, books, paintings, ...) fundamentally has to be shown and can therefore be trained on. LLMs bring us towards a point where open source, source available or proprietary, none of those concepts will matter: if you manage to train your LLM on that code (even proprietary code that was illegally leaked), you'll have essentially stolen it in a way that may be impossible to detect.

How in the world does it sound like it is a desirable future?


> A gun that kills a person has not broken any law? It certainly can't be prosecuted.

Yeah dude…its an inanimate object.


Maybe I need to explain it: my point is that the one responsible is the human behind the gun... or behind the LLM. The argument that "an LLM cannot do anything illegal because it is not a human" is nonsense: it is operated by a human.

I feel like nobody cares. It sucks, I know. Like climate change, biodiversity loss, the energy crisis.

Feels like we're pretty much screwed. Doesn't mean it's not a problem.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: