It’s the transformer making the difference. Original stable diffusion uses convo...

qumpis · on Feb 23, 2024

Convolutions are bad at long range spatial dependencies? What makes you say that - any chance you have a reference?

ttul · on Feb 23, 2024

Convolution filters attend to a region around each pixel; not to every other pixel (or patch in the case of DiT). In that way, they are not good at establishing long range dependencies. The U-Net in Stable Diffusion does add self-attention layers but these operate only in the lower resolution parts of the model. The DiT model does away with convolutions altogether, going instead with a linear sequence of blocks containing self-attention layers. The dimensionality is constant throughout this sequence of blocks (i.e. there is no downscaling), so each block gets a chance to attend to all of the patch tokens in the image.

One of the neat things they do with the diffusion transformer is to enable creating smaller or larger models simply by changing the patch size. Smaller patches require more Gflops, but the attention is finer grained, so you would expect better output.

Another neat thing is how they apply conditioning and the time step embedding. Instead of adding these in a special way, they simply inject them as tokens, no different from the image patch tokens. The transformer model builds its own notion of what these things mean.

This implies that you could inject tokens representing anything you want. With the U-Net architecture in stable diffusion, for instance, we have to hook onto the side of the model to control it in various sort of hacky ways. With DiT, you would just add your control tokens and fine tune the model. That’s extremely powerful and flexible and I look forward to a whole lot more innovation happening simply because training in new concepts will be so straightforward.

andrewfong · on Feb 23, 2024

My understanding of this tech is pretty minimal, so please bear with me, but is the basic idea is something like this?

Before: Evaluate the image in a little region around each pixel against the prompt as a whole -- e.g. how well does a little 10x10 chunk of pixels map to a prompt about a "red sphere and blue cube". This is problematic because maybe all the pixels are red but you can't "see" whether it's the sphere or the cube.

After: Evaluate the image as a whole against chunks of the prompt. So now we're looking at a room, and then we patch in (layer?) a "red sphere" and then do it again with a "blue cube".

Is that roughly the idea?

feoren · on Feb 23, 2024

It kinda makes sense, doesn't it? What are the largest convolutions you've heard of -- 11 x 11 pixels? Not much more than that, surely? So how much can one part of the image influence another part 1000 pixels away? But I am not an expert in any of this, so an expert's opinion would be welcome.

qumpis · on Feb 23, 2024

Yes it makes sense a bit. Many popular convents operate on 3x3 kernels. But the number of channel increases per layer. This, coupled with the fact that the receptive field increases per layer and allows convnets to essentially see the whole image relatively early in model's depth (esp. coupled with pooling operations which increase the receptive field rapidly), makes this intuition questionable. Transformers on the other hand, operate on attention which allows them to weight each patch dynamically, but it's clear to me that this allows them to attend to all parts of the image in a way different from convnets.