OpenAI destroyed a trove of books used to train AI models

kelnos · on May 9, 2024

Feels weird that the Authors Guild so much wants to make public the names of the former OpenAI employees who created the data sets. That seems entirely unnecessary and irrelevant to the case. If what they did was part of their sanctioned work for OpenAI, they are not the ones responsible for how it was used.

tempest_ · on May 9, 2024

Is "I was just following orders" a viable defense here?

pointlessone · on May 9, 2024

It might be. Is it reasonable to assume the material was cleared by leagal department if your manager came and told you to download books from this long list and clean them up for training purposes?

NikkiA · on May 9, 2024

Not really no, I've been in that situation and we were all aware that we were committing piracy in the company's name.

Daviey · on May 9, 2024

Were you committing a criminal offence or tort, and was it in your personal or corporate capacity?

NikkiA · on May 11, 2024

Well, since we were asked to locate material to pirate, which the company intended to sell, it would be criminal, but the criminal act would be the sale of the material, not the procurement of the material, as I read the law.

There might have been a possibility of a claim that we employees asked to download the material were committing a criminal offense under the 'posession with during business with intent to commit and act of infringement', but even there I think it'd be the VPs that had the stupid idea that were committing infringement, not us lowly employees asked to acquire the material.

About the best I suspect CPS could hope for would be a threat as a cudgel to admit who suggested the idea. But I'd have happily told them, and I left the company very shortly after.

freejazz · on May 9, 2024

Until OpenAI takes the blame, why would the Author's Guild not pursue their claim?

fancyfredbot · on May 9, 2024

"These datasets, created by former employees who are no longer with OpenAI, were last used in 2021 and deleted due to non-use in 2022." -

Deleting the dataset because of non-use sounds completely implausible. It says the dataset is 67B tokens, which is less than 1TB of data. Why would you bother to delete it given it would cost more or less nothing to keep?

geor9e · on May 9, 2024

If something is a legal liability to have, non-use is probably 2 seconds after you finish using it (in their case, finishing training, just keeping the weights)

fancyfredbot · on May 9, 2024

Exactly. The reason that they stopped using the dataset and the reason they deleted it are likely the same - legal liability.

Obviously not in their interests to state that but when this is the best alternative explanation they can offer they might as well have.

Personally I support the use of these books for training AI but I think this needs to be decided in court and/or with legislation, not hidden under the carpet.

Incipient · on May 10, 2024

>Personally I support the use of these books for training AI

I need a better understanding of postgres. Can you write me an A-Z book on postgres? I won't actually buy it from you, just grab the text, train a model, and get the model to answer all questions I have. But the book would be super helpful please. Oh I'm also going to sell a service on this model too...because I like money.

I get sarcasm isn't exactly a great form of debate, but it felt suitable here. I ABSOLUTELY understand why people don't want AI training on their books.

fancyfredbot · on May 10, 2024

Thanks, that's a very interesting perspective which I hadn't previously considered.

lulzury · on May 9, 2024

Assuming they train newer models using output from older versions, isn't that data they collected from "books1" and "books2" still encoded in the weights of their current models?

This also begs the question, does OpenAI really honor their privacy controls and not use user information to train their models if the user opts out? It seems most companies are operating in "ask for forgiveness than permission"-mode as they scramble to stay competitive in the AI race. [0]

[0] https://news.ycombinator.com/item?id=40127106

fallingsquirrel · on May 9, 2024

> Assuming they train newer models using output from older versions, isn't that data they collected from "books1" and "books2" still encoded in the weights of their current models?

Sure, in much the same way if you save a JPEG at 75% quality the data in the image is still encoded. But if you repeat a lossy encoding over and over without saving the original, well... wikipedia has a nice visualization of what happens: https://en.wikipedia.org/wiki/Generation_loss

Legend2440 · on May 9, 2024

They do not train newer models on output from older versions. In fact, they deliberately try to filter out LLM-generated text from their scraped internet datasets.

tivert · on May 9, 2024

> The unsealed letter from OpenAI's lawyers, which is labeled "highly confidential - attorneys' eyes only," says that the use of "books1" and "books2" for model training was discontinued in late 2021 and that the datasets were deleted in mid-2022 because of their nonuse. The letter goes on to say that none of the other data used to train GPT-3 has been deleted and offers attorneys for the Authors Guild access to those other datasets.

That sounds like ass-covering, and maybe destruction of evidence. If the data is destroyed, won't it be much harder to prove which books they violated copyright on and to figure out the damages owed?

nickpsecurity · on May 9, 2024

There was one scenario like this I came up with when brainstorming copyright issues. In the past, articles I read said you could have a backup copy of a copyrighted work that stayed with you. Moving one in a different direction than the other might be a violation since it could be distribution of copies. But, we might be able to make one, digital copy of a physical work for our own use. How to use that?

Problem statement for poor person’s pre training: Where to get lots of data for use if they aren’t providing it as data sets and we can’t share them? And for multimodal models? And less risk in copyright?

The idea was to buy lots of encyclopedia sets, school curriculums, picture books… used media at low prices (esp bin sales) full of information. Digitize them with book scanners. Keep the digital copies and throw away the physical copies. Now, you have a huge, training set of legal data acquired dirt cheap with preprocessing allowing cheaper labor.

From there, use the copies in places like Japan where it’s legal to use them for AI training so long as one has legal access. This also incentivizes the owner to pre-train the model themselves so there no distribution of original works. Also, I envisioned people partly training models with their data, handing the weights off to another company, they add theirs to it, and so on. Daisy chain the process with only the models, not the copyrighted works, distributed. My copyright amendment added preprocessing and copying for just this purpose to avoid these ridiculous hacks.

To be clear, I wouldn’t do this without consulting several lawyers on what was clear. I’d rather not be destroying books and filling landfills. It is pure speculation I made due to how overly-strong copyright is in my country. It assumes we can use copyrighted works (a) for personal use and (b) with a physical to digital conversion. If not, I also thought those would be easier rights to fight for, too.

However, legal hacks like scanning used works to be trained in other countries might be all we have if legal systems don’t adapt to what society is currently doing with copyrighted works. I mean in a way that’s fair to all sides rather than benefiting only one. I’m up for compromise. Pro-copyright side usually hasn’t been, though.

throwthrowuknow · on May 9, 2024

So when is the author’s guild going to realize that you can pretrain a base model on public ___domain material and then once it’s distributed anyone can fine tune it on whatever books they want to have their own version that writes in the style of X? On commodity GPUs no less. In 5 years when the current gen GPUs are cheap and there are hundreds of one click fine tuning apps this will seem like an absurd waste of time.

berkes · on May 9, 2024

Isn't copyright about reproducing and/or distributing?

Does training a model count as reproduction or distribution?

Or can copyright say something about how I consume my books? Can copyright prohibit me to light a fire, or whipe my behind with, say Harry Potter and the goblet of fire?

UncleEntity · on May 9, 2024

> Or can copyright say something about how I consume my books?

It cannot.

This whole kerfuffle isn't about OpenAI buying a bunch of books and disposing of them in a non-copyright friendly way though.

freejazz · on May 9, 2024

Questions like these are so far below a level of reasonable discourse that they are detrimental.

TremendousJudge · on May 9, 2024

Well you usually cannot sell derivative works as your own.

berkes · on May 9, 2024

But is training a model "selling derivatives"? And is so, of what?

freejazz · on May 9, 2024

If it so, that's copyright infringement. The pending litigations are, in part, to resolve the dispute about whether or not they are "selling derivates." I'm not sure why you conflate your own personal lack of knowledge about these matters with a good argument against the copyright holders.

berkes · on May 11, 2024

Judging from your derogatory replies, you hold a strong opinion on the matter.

Is that opinion based on facts and actual cases? Because as far as I know, the entire reason we have these law suits as presented in TLA, is to find out how the law stands on the exact questions that I pose.

So far, it seems copyright (in its current form) isn't suitable for (content) creators to prohibit LLM-researchers and -providers from training models on their creations.

freejazz · on May 11, 2024

Which opinion?

yazzku · on May 9, 2024

I am not sure if you meant 'wipe' or 'whip', but under the assumption of the former, I do not recommend Harry Potter and the Goblet of Fire for the canonical wiping of the ass. Harry Potter and the Philosopher's Stone is where the game is, and always was, to the chagrin of many a copyright troll. I speak directly from experience.

fragmede · on May 11, 2024

I wish there was an unflag button

lioeters · on May 11, 2024

Flagging this submission seems weird, it's genuinely interesting tech news worth a discussion. On some articles I see a link to "Vouch", but not for this one.

benfutor · on May 9, 2024

Never fun when you're about to land on an article and automatically hit a paywall. Do better, BI

huhuhu111 · on May 9, 2024

https://archive.is/tlUQh

huhuhu111 · on May 9, 2024

Isn't all/most their training data copyrighted anyways?

We just have to say it's fair use, because it is useful to everyone. Maybe just require them to open their model.

ilrwbwrkhv · on May 9, 2024

Yup. The big pretense we pull as an industry is to pretend all of the data for all these models are somehow legitimate. It's all illegal. But what are you gonna do about it?

tivert · on May 9, 2024

> Yup. The big pretense we pull as an industry is to pretend all of the data for all these models are somehow legitimate. It's all illegal. But what are you gonna do about it?

I feel the tech industry took the proverb "better ask for forgiveness than permission," then dropped the "forgiveness" part.

throwthrowuknow · on May 9, 2024

I think anyone who wants to opt out of being in the training data for LLMs should be able to just like anyone who doesn’t want their website indexed by Google should also be able to opt out.