It is a challenge for these models to generate images of counterintuitive or unusual situations that aren't depicted in the training set. For example, if you ask for a small cube sitting on top of a large cube, you'll likely get the correct result on the first attempt. Ask for a large cube on a small cube and you'll probably get an image of them side-by-side or with the small cube on top instead. The models can generalize in impressive ways, but it's still limited.
A while ago my daughter wanted an image of Santa pulling a sleigh with a reindeer in the driver's seat holding the reins. We tried dozens of different prompts and Dall-e 3 could not do it.
It's likely a result of the interplay between the image generation and caption/description generation aspects of the model. The earliest diffusion-based image generators used a 'bag of words' model for the caption (see musing regarding this and DALL-E 3: https://old.reddit.com/r/slatestarcodex/comments/16y14co/sco...), whereby 'a woman chasing a bear' would turn into `['a', 'a', 'chasing', 'bear', 'woman']`.
That's good enough to describe compositions well-represented in the training set, but it will be likely to lock-in to those common representations at the expense of rarer but still possible ones (the 'woman chasing a bear' above).
Being able to generate content w/ minimal presence in the training set is arguably an emergent, desirable behavior that could be seen as a form of intelligence.