Hacker News new | past | comments | ask | show | jobs | submit login

That's _amazing_.

I imagine this doesn't look impressive to anyone unfamiliar with the scene, but this was absolutely impossible with any of the older models. Though, I still want to know if it reliabily does this--so many other things are left to chance, if I need to also hit a one-in-ten chance of the composition being right, it still might not be very useful.




It’s the transformer making the difference. Original stable diffusion uses convolutions, which are bad at capturing long range spatial dependencies. The diffusion transformer chops the image into patches, mixes them with a positional embedding, and then just passes that through multiple transformer layers as in an LLM. At the end, the model unpatchify’s (yes, that term is in the source code) the patched tokens to generate output as a 2D image again.

The transformer layers perform self-attention between all pairs of patches, allowing the model to build a rich understanding of the relationships between areas of an image. These relationships extend into the dimensions of the conditioning prompts, which is why you can say “put a red cube over there” and it actually is able to do that.

I suspect that the smaller model versions will do a great job of generating imagery, but may not follow the prompt as closely, but that’s just a hunch.


Convolutions are bad at long range spatial dependencies? What makes you say that - any chance you have a reference?


Convolution filters attend to a region around each pixel; not to every other pixel (or patch in the case of DiT). In that way, they are not good at establishing long range dependencies. The U-Net in Stable Diffusion does add self-attention layers but these operate only in the lower resolution parts of the model. The DiT model does away with convolutions altogether, going instead with a linear sequence of blocks containing self-attention layers. The dimensionality is constant throughout this sequence of blocks (i.e. there is no downscaling), so each block gets a chance to attend to all of the patch tokens in the image.

One of the neat things they do with the diffusion transformer is to enable creating smaller or larger models simply by changing the patch size. Smaller patches require more Gflops, but the attention is finer grained, so you would expect better output.

Another neat thing is how they apply conditioning and the time step embedding. Instead of adding these in a special way, they simply inject them as tokens, no different from the image patch tokens. The transformer model builds its own notion of what these things mean.

This implies that you could inject tokens representing anything you want. With the U-Net architecture in stable diffusion, for instance, we have to hook onto the side of the model to control it in various sort of hacky ways. With DiT, you would just add your control tokens and fine tune the model. That’s extremely powerful and flexible and I look forward to a whole lot more innovation happening simply because training in new concepts will be so straightforward.


My understanding of this tech is pretty minimal, so please bear with me, but is the basic idea is something like this?

Before: Evaluate the image in a little region around each pixel against the prompt as a whole -- e.g. how well does a little 10x10 chunk of pixels map to a prompt about a "red sphere and blue cube". This is problematic because maybe all the pixels are red but you can't "see" whether it's the sphere or the cube.

After: Evaluate the image as a whole against chunks of the prompt. So now we're looking at a room, and then we patch in (layer?) a "red sphere" and then do it again with a "blue cube".

Is that roughly the idea?


It kinda makes sense, doesn't it? What are the largest convolutions you've heard of -- 11 x 11 pixels? Not much more than that, surely? So how much can one part of the image influence another part 1000 pixels away? But I am not an expert in any of this, so an expert's opinion would be welcome.


Yes it makes sense a bit. Many popular convents operate on 3x3 kernels. But the number of channel increases per layer. This, coupled with the fact that the receptive field increases per layer and allows convnets to essentially see the whole image relatively early in model's depth (esp. coupled with pooling operations which increase the receptive field rapidly), makes this intuition questionable. Transformers on the other hand, operate on attention which allows them to weight each patch dynamically, but it's clear to me that this allows them to attend to all parts of the image in a way different from convnets.


I put the prompt into ChatGPT and it seemed to work just fine: https://imgur.com/LsRM7G4


You got lucky! Here's a thread where I attempted the same just now: https://imgur.com/a/xiaiKXp

It has a lot of difficulty with the orientation of the cat and dog, and by the time it gets them in the right positions, the triangle is lost.


I dislike the look of chatGPT images so much. The photo-realism of stable diffusion impresses me a lot more for some reason.


This is just stylistic, and I think it’s because chatgpt knows a bit “better” that there aren’t very many literal photos of abstract floating shapes. Adding “studio photography, award winner” produced results quite similar to SD imo, but this does negatively impact the accuracy. On the other side of the coin, “minimalist textbook illustration” definitely seems to help the accuracy, which I think is soft confirmation of the thought above.

https://imgur.com/a/9fO2gxN

EDIT: I think the best approach is simply to separate out the terms in separate phrases, as that gets more-or-less 100% accuracy https://imgur.com/a/JGjkicQ

That said, we should acknowledge the point of all this: SD3 is just incredibly incredibly impressive.


This is adjustable via the API, but not in ChatGPT. The API offers styles of "vivid" and "natural", but ChatGPT only uses "vivid".


It looks terrible to me though, very basic rendering and as if it’s lower resolution then scaled up.


What was difficult about it?


From my experience, the thing that makes using AI image gen hard to use is nailing specificity. I often find myself having to resort to generating all of the elements I want out of an image separately and then comp them together with photoshop. This isn't a bad workflow, but it is tedious (I often equate it to putting coins in a slot machine, hoping it 'hits').

Generating good images is easy but generating good images with very specific instructions is not. For example, try getting midjourney to generate a shot of a road from the side (ie standing on the shoulder of a road taking a photo of the shoulder on the other side with the road crossing frame from left to right)...you'll find midjourney only wants to generate images of roads coming at the "camera" from the vanishing point. I even tried feeding an example image with the correct framing for midjourney to analyze to help inform what prompts to use, but this still did not result in the expected output. This is obviously not the only framing + subject combination that model(s) struggle with.

For people who use image generation as a tool within a larger project's workflow, this hurdle makes the tool swing back and forth from "game changing technology" to "major time sink".

If this example prompt/output is an honest demonstration of SD3's attention to specificity, especially as it pertains to framing and composition of objects + subjects, then I think its definitely impressive.

For context, I've used SD (via comfyUI), midjourney, and Dalle. All of these models + UIs have shared this issue in varying degrees.


It's very difficult to improve text-to-image generation to do better than this because you need extremely detailed text training data, but I think a better approach would be to give up on it.

> I often find myself having to resort to generating all of the elements I want out of an image separately and then comp them together with photoshop. This isn't a bad workflow, but it is tedious

The models should be developed to accelerate this then.

ie you should be able to say layer one is this text prompt plus this camera angle, layer two is some mountains you cheaply modeled in Blender, layer three is a sketch you drew of today's anime girl.


Totally agree. I am blown away by that image. Midjourney is so bad at anything specific.

On the other hand, SD has just not been on the level of the quality of images I get from Midjourney. The people who counter this I don't think know what they are talking about.

Can't wait to try this.


previous systems could not compose objects within the scene correctly, not to this degree. what changed to allow for this? could this be a heavily cherrypicked example? guess we will have to wait for the paper and model to find out


From the original paper with this technique:

  We introduce Diffusion Transformers (DiTs), a simple transformer-based backbone for diffusion models that outperforms prior U-Net models and inherits the excellent scaling properties of the transformer model class. Given the promising scaling results in this paper, future work should continue to scale DiTs to larger models and token counts. DiT could also be explored as a drop-in backbone for text-to-image models like DALL E 2 and Stable Diffusion.
Afaict the answer is that combining transformers with diffusers in this way means that the models can (feasibly) operate in a much larger, more linguistically-complex space. So it’s better at spatial relationships simply because it has more computational “time” or “energy” or “attention” to focus on them.

Any actual experts want to tell me if I’m close?


would be nice if it were just more attention. there could be something else though




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: