From my experience, the thing that makes using AI image gen hard to use is nailing specificity. I often find myself having to resort to generating all of the elements I want out of an image separately and then comp them together with photoshop. This isn't a bad workflow, but it is tedious (I often equate it to putting coins in a slot machine, hoping it 'hits').
Generating good images is easy but generating good images with very specific instructions is not. For example, try getting midjourney to generate a shot of a road from the side (ie standing on the shoulder of a road taking a photo of the shoulder on the other side with the road crossing frame from left to right)...you'll find midjourney only wants to generate images of roads coming at the "camera" from the vanishing point. I even tried feeding an example image with the correct framing for midjourney to analyze to help inform what prompts to use, but this still did not result in the expected output. This is obviously not the only framing + subject combination that model(s) struggle with.
For people who use image generation as a tool within a larger project's workflow, this hurdle makes the tool swing back and forth from "game changing technology" to "major time sink".
If this example prompt/output is an honest demonstration of SD3's attention to specificity, especially as it pertains to framing and composition of objects + subjects, then I think its definitely impressive.
For context, I've used SD (via comfyUI), midjourney, and Dalle. All of these models + UIs have shared this issue in varying degrees.
It's very difficult to improve text-to-image generation to do better than this because you need extremely detailed text training data, but I think a better approach would be to give up on it.
> I often find myself having to resort to generating all of the elements I want out of an image separately and then comp them together with photoshop. This isn't a bad workflow, but it is tedious
The models should be developed to accelerate this then.
ie you should be able to say layer one is this text prompt plus this camera angle, layer two is some mountains you cheaply modeled in Blender, layer three is a sketch you drew of today's anime girl.
Totally agree. I am blown away by that image. Midjourney is so bad at anything specific.
On the other hand, SD has just not been on the level of the quality of images I get from Midjourney. The people who counter this I don't think know what they are talking about.
Generating good images is easy but generating good images with very specific instructions is not. For example, try getting midjourney to generate a shot of a road from the side (ie standing on the shoulder of a road taking a photo of the shoulder on the other side with the road crossing frame from left to right)...you'll find midjourney only wants to generate images of roads coming at the "camera" from the vanishing point. I even tried feeding an example image with the correct framing for midjourney to analyze to help inform what prompts to use, but this still did not result in the expected output. This is obviously not the only framing + subject combination that model(s) struggle with.
For people who use image generation as a tool within a larger project's workflow, this hurdle makes the tool swing back and forth from "game changing technology" to "major time sink".
If this example prompt/output is an honest demonstration of SD3's attention to specificity, especially as it pertains to framing and composition of objects + subjects, then I think its definitely impressive.
For context, I've used SD (via comfyUI), midjourney, and Dalle. All of these models + UIs have shared this issue in varying degrees.