Hacker News new | past | comments | ask | show | jobs | submit login

All LLMs are trained on the same basic blob of data - mostly in English, mostly pirated books and stuff.





That's wrong.

Many LLMs are trained on synthetic data produced by other LLMs. (Indirectly, they may be trained on pirated books. Sure. But not directly.)


Likely the case for established model makers, but barring illegal use of outputs from other companies' models, a "first generation" model would still need this as a basis, no?

Why illegal? The more open models (or at least open-weight models) should allow using their outputs. Details depend on license.

But yes, 'first generation' models would be trained on human text almost by definition. My comment was only to contradict the claim that 'all LLMs' are trained from stolen text, by noting that some LLMs aren't trained (directly) on human text at all.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: