Maybe that's a lot to ask, but can someone explain to me or guide me to material...

sillysaurusx · on May 12, 2021

Not a lot to ask at all. The go-to person (and one of the pioneers of this type of model) is Alexia: https://twitter.com/jm_alexia

The article you want is probably this one: https://ajolicoeur.wordpress.com/the-new-contender-to-gans-s...

I would launch into a thorough explanation, but it would likely not be correct in every detail, because it's been around nine months since I was immersing myself in DDPM type models. But, broadly speaking, with normal training, your goal is to train a model (show it a bunch of examples) until the model can guess the right answer most of the time in one try. Except, "guessing the right answer" is actually an easier problem than generating an image, because the model usually gives you its top-N guesses, so it says "I think it's a snake or a dog or an apple."

Whereas with generative images, it's much harder to come up with a technique that can be "sort of correct": if you generate a stylegan image, it either looks cool or looks like crap, and it's rather difficult to automatically take a crappy output and turn it into something that looks cool. (The "automatic" is key; there are manual human-guided techniques that I'm quite fond of, and amazed no one's turned it into a photoshop-type plugin yet, but the field of ML seems to compete/care about fully automatic solutions right now. For some reason.)

DDPM is the inverse: you have a trained model, and you start with noise, but then it gets progressively closer to a cool looking result by searching multiple times (i.e. taking multiple forward passes). That's as much as I remember, I'm afraid.

doctorhandshake · on May 12, 2021

What are the manual / human-guided techniques you speak of?

sillysaurusx · on May 13, 2021

I wanted to do a thorough writeup, but I never got around to it. Here are a bunch of examples of me using those techniques though: https://twitter.com/theshawwn/status/1182208124117307392

The dota community seemed to like it. :)

https://www.reddit.com/r/DotA2/comments/dfv0z3/im_making_a_n...

Basically, it was an interactive editor where you could slightly move along stylegan directions, combined with Peter Baylies' reverse encoder to slightly morph the image to a specific face on demand.

It was instantly so much better than any automated solution. It felt like being a pilot in front of the controls of a cockpit.

ericjang · on May 12, 2021

I can try to provide a high level intuition. For actual experts, please forgive my errors of simplification.

Drawing samples from "simple" distributions like normal distributions is computationally easy. You can do it in microseconds. Sampling an arbitrary 1-D distribution is a bit harder - you have to invert its cumulative density function to recover probability under an "equivalent" uniform random variable, and potentially use a rejection sampling approach to sample 1-D values under this distribution.

Sampling a high-D distribution (such as an image) is even harder - you need to learn a mapping from this high-D image back to "tractable" densities. This imposes some pretty harsh "optimization bottlenecks" when trying to contort the manifold of images to normal distributions. The whole point of this exercise is that the transformation respects a valid probability distribution, so you can start from the normal distribution and apply this mapping to get a valid sample from the image distribution. This in practice is pretty hard, and the quality of samples seems to be lower than other forms of deep generative models which use less parameters.

Now instead of learning such a complex, hard-to-optimize transformation to valid densities, what if we instead learn a function E(x) that outputs a scalar "energy". The energy is low for "realistic" images, and high for unrealistic images. Kind of like inverse probability, except its not normalized - the energy value tells you nothing about likelihood unless you know the energy for all other images possible. This tends to actually be easier than learning densities, because the functional form of this energy function is unconstrained.

Furthermore, not knowing likelihoods doesn't stop you from getting "realistic" image, as all you need to do is descend the gradient x -= d/dx E(x), which takes you to an image with "lower" energy (i.e. more realistic). Under certain procedures (e.g. adding some noise to the gradient), this can be thought of as actually equivalent to sampling from a valid probability distribution, even though you can't compute its likelihood analytically.

The diffusion probabilistic model you refer to can be thought of as such a model - the more steps you take (i.e. the more compute you spend), the better the quality of the model.

GANs can be thought of as a one-pass neural network amortization of the end result of the diffusion process, but unlike MCMC methods, they cannot be "iteratively refined" with additional compute. Their sample quality is limited to whatever the generator spits out, even if you had additional compute.

ziedaniel1 · on May 12, 2021

This sounds like you're describing energy-based models, not diffusion models.

ericjang · on May 13, 2021

Ah, thanks. I had mistakenly assumed the diffusion process here was comparable / drop-in replacement to Langevin dynamics used with energy models.

ericjang · on May 17, 2021

I went and read the paper in more detail. Yeah, my original comment was way off-base, except if you draw a parallel with normalizing flows as iterated refinement (similar to iterated de-noising), and see that the DDPM is a more unconstrained form.

But at a surface level, there isn't a clear connection between DDPM and energy-based models.

iptpus · on May 13, 2021

https://yang-song.github.io/blog/2021/score/ is an excellent one

mhagiwara · on May 12, 2021

See my comment above. I highly recommend this talk by Stefano Ermon from Stanford: https://www.youtube.com/watch?v=8TcNXi3A5DI