Hacker News new | past | comments | ask | show | jobs | submit login

As an applied ML practitioner, currently we get to choose how to use synthetic data vs "real" data, and in what proportions. This can be a valuable tool in our kit. To the degree that data in the wild becomes an unlabeled mix of the two, functionally we lose the ability to make those choices for any given model.

> Eventually the models won't work and people will notice

For any product dependent on these models, that sounds like pretty negative outcome ... and entirely consistent with my concern that "[t]here's a very real possibility that using generative models more (and publishing their outputs) can make these models worse in the future."

> and either create new training data or go back to only using older data to train

Especially given that currently LLMs basically learn about entities and concepts in the world via their training text, this breaks the ability to update the model to know about more recent topics of discourse independently from shifting the real vs synthetic proportions.

> there will always be people who ... will still create new art and new writing, which will seed the training data

But if we aren't able to consistently separate the human-generated and machine-generated content, model training won't be able to place any extra weight on the human-generated stuff. The mere fact that human-generated output doesn't disappear entirely doesn't remove these issues.

The analogy is loose, but click fraud creates realistic looking data exhaust that looks close to the behavior of a real user, and can meaningfully disrupt one's ability to optimize for clicks or to know how many actual end users interacted with your item of interest. The fact that some nonzero portion of the clicks are real doesn't erase these problems. And that's in a system which doesn't create the kind of feedback loop described above.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: