Hacker News new | past | comments | ask | show | jobs | submit login

It's likely a result of the interplay between the image generation and caption/description generation aspects of the model. The earliest diffusion-based image generators used a 'bag of words' model for the caption (see musing regarding this and DALL-E 3: https://old.reddit.com/r/slatestarcodex/comments/16y14co/sco...), whereby 'a woman chasing a bear' would turn into `['a', 'a', 'chasing', 'bear', 'woman']`.

That's good enough to describe compositions well-represented in the training set, but it will be likely to lock-in to those common representations at the expense of rarer but still possible ones (the 'woman chasing a bear' above).




Tried it myself. Couldn't do it. It's a nice test case!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: