All LLMs are trained on the same basic blob of data - mostly in English, mostly ...

eru · 2025-05-01T11:56:53 1746100613

That's wrong.

Many LLMs are trained on synthetic data produced by other LLMs. (Indirectly, they may be trained on pirated books. Sure. But not directly.)

loufe · 2025-05-01T12:43:01 1746103381

Likely the case for established model makers, but barring illegal use of outputs from other companies' models, a "first generation" model would still need this as a basis, no?

eru · 2025-05-01T17:55:03 1746122103

Why illegal? The more open models (or at least open-weight models) should allow using their outputs. Details depend on license.

But yes, 'first generation' models would be trained on human text almost by definition. My comment was only to contradict the claim that 'all LLMs' are trained from stolen text, by noting that some LLMs aren't trained (directly) on human text at all.