previous systems could not compose objects within the scene correctly, not to th...

bbor · on Feb 22, 2024

From the original paper with this technique:

  We introduce Diffusion Transformers (DiTs), a simple transformer-based backbone for diffusion models that outperforms prior U-Net models and inherits the excellent scaling properties of the transformer model class. Given the promising scaling results in this paper, future work should continue to scale DiTs to larger models and token counts. DiT could also be explored as a drop-in backbone for text-to-image models like DALL E 2 and Stable Diffusion.

Afaict the answer is that combining transformers with diffusers in this way means that the models can (feasibly) operate in a much larger, more linguistically-complex space. So it’s better at spatial relationships simply because it has more computational “time” or “energy” or “attention” to focus on them.

Any actual experts want to tell me if I’m close?

lucidrains · on Feb 23, 2024

would be nice if it were just more attention. there could be something else though