My understanding of this tech is pretty minimal, so please bear with me, but is ...

My understanding of this tech is pretty minimal, so please bear with me, but is the basic idea is something like this?

Before: Evaluate the image in a little region around each pixel against the prompt as a whole -- e.g. how well does a little 10x10 chunk of pixels map to a prompt about a "red sphere and blue cube". This is problematic because maybe all the pixels are red but you can't "see" whether it's the sphere or the cube.

After: Evaluate the image as a whole against chunks of the prompt. So now we're looking at a room, and then we patch in (layer?) a "red sphere" and then do it again with a "blue cube".

Is that roughly the idea?