Sorry if I described it confusingly. What I meant to say was, with StyleGAN, you have one forward pass (~150ms). With diffusion models, you must have at least 25 forward passes (25 x ~150ms). You're right that chips will get faster, but those speedups won't trickle down to the market segment near and dear to my heart: tinkerers like you and me who just want to play around with a model at home without needing to spend thousands of dollars, or to rent a supercomputer.
In reality though, diffusion models are probably fast and lightweight enough that (with patience) you'll be able to generate some neat stuff yourself. At least, if you have an nVidia GPU, or are lucky with a Colab instance. Me, though, I never had one, and a lot of times I was constrained to play with the models that I could run on CPU inference only. I was often delighted to discover that CPU inference gets you quite far! But with 25 forward passes instead of 1, it would be 25x more painful to play around with them -- on the order of waiting 15+ minutes per attempt, rather than seeing things happen in ~25s. The activation energy adds up, and I'm keen to keep ML as accessible as possible for people who just want to play, since playing is a key step toward "Ok, I guess this ML stuff is worth taking more seriously. Let me dive in..."
That's not to dismiss Diffusion models whatsoever. I just had a sight twinge of sadness that being able to generate interpolation videos (one of the coolest things you can possibly do with generative image models) might be out of reach of people without GPUs (I was one of them).
In reality though, diffusion models are probably fast and lightweight enough that (with patience) you'll be able to generate some neat stuff yourself. At least, if you have an nVidia GPU, or are lucky with a Colab instance. Me, though, I never had one, and a lot of times I was constrained to play with the models that I could run on CPU inference only. I was often delighted to discover that CPU inference gets you quite far! But with 25 forward passes instead of 1, it would be 25x more painful to play around with them -- on the order of waiting 15+ minutes per attempt, rather than seeing things happen in ~25s. The activation energy adds up, and I'm keen to keep ML as accessible as possible for people who just want to play, since playing is a key step toward "Ok, I guess this ML stuff is worth taking more seriously. Let me dive in..."
That's not to dismiss Diffusion models whatsoever. I just had a sight twinge of sadness that being able to generate interpolation videos (one of the coolest things you can possibly do with generative image models) might be out of reach of people without GPUs (I was one of them).