Hacker News new | past | comments | ask | show | jobs | submit login

Also, openai only started making deals (and mostly with news publishers) after the NYT lawsuit.

https://www.npr.org/2025/01/14/nx-s1-5258952/new-york-times-...

They didn't even consider doing this before. They still, as far as I know, haven't paid a dime for any book, or art beyond stock photography.

Lawsuit is still ongoing, if openai loses it might spell doom for legal production and usage of LLMs as a whole. There isn't enough open, free data out there to make state of the art AI.




> There isn't enough open, free data out there to make state of the art AI.

But there are models trained on legal content (like Wikipedia or StackOverflow). Also, no human needs to read millions of pirated books to become intelligent.


> But there are models trained on legal content (like Wikipedia or StackOverflow)

Literally all of them are trained on wikipedia and SO. But /none/ of them are /only/ trained on wikipedia and SO. They need much more than that.

> Also, no human needs to read millions of pirated books to become intelligent.

Obviously, LLM architectures that were inspired by GPT 2/3 are not learning like humans.

There has never been anything remotely good in the world of LLM that could have been said to have been trained on a moderate, more human scoped amount of data. They're all trained on trillions of tokens.

Models trained on less than 1T are experimental jokes that have no real use to provide.

You'll notice even so called "open data" LLMs like Olmo are, in fact, also trained on copyrighted data, datasets like Common Crawl claim fair use over anything that can be accessed from a web browser.

And then there's the whole notion of laundered data by training on synthetic data generated by another LLM. All the so-called "open" LLMs include a very significant amount of LLM-generated data. If you agree to the notion that LLMs trained on copyrighted work are a form of IP infringement and not fair use, then training on their output is just data laundering and doesn't fix the issue.


> If you agree to the notion that LLMs trained on copyrighted work are a form of IP infringement and not fair use, then training on their output is just data laundering and doesn't fix the issue.

It's fuzzy. I could imagine a situation where a primary LLM trained on copyrighted material is a big hazard and can't be released, but carefully monitored and filtered output could be declared copyright-safe, and then used to make a copyright-safe secondary LLM.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: