Personally I think trained models are derived works of all the training data. Ju...

JAlexoid · on Feb 13, 2024

You're trying to use words without the legal context here. The legal definition of words isn't 1-1 wit our colloquial usage.

Translation of a book is non-transformative and retains the original author's artistic expression.

As a counter example - if you write an essay about Picasso's Guernica painting, it is derivative according to our colloquial use of the term, but legally it's an original work.

fenomas · on Feb 13, 2024

Wikipedia:

> In copyright law, a derivative work is an expressive creation that includes major copyrightable elements of ... the underlying work

A trained model fails that on two counts, doesn't it? Both the "includes" part, and the fact that a model is itself not an expressive work of authorship.

frabcus · on Feb 15, 2024

I'm not sure. If it fails, then I reckon a binary compiled from source code fails top.

There's nothing creative about the act of a compiler, it is automatic, just like the training run of an LLM.

And no part of the original source code is in the binary output.

And yet, binaries are a derived work from the source code that went into them.

So something is up! I am not a lawyer though.

fenomas · on Feb 16, 2024

> And no part of the original source code is in the binary output.

It's not about whether the binary includes the raw text of the source, but whether it copies the expressive content. Anything expressive (i.e. copyrightable) in a compiled binary must have come from the sourcecode, so that's what makes it a derived work.

But the same isn't true of LLMs, which are more like "data about their inputs", than "a transformed version of their inputs".

thuuuomas · on Feb 13, 2024

Curating training data is an exercise in editorial judgement.

fenomas · on Feb 14, 2024

If a trained model doesn't meet the definition of being a derivative work, it doesn't matter whether the data it's not a derivative work of was curated.