The map-to-photo only seems to work because there is a limited training set built with the exact same photographic tile as the vector map. It's reproducing features that are not possible to infer.
In all of these cases, the software is supplying details that are likely in context on the basis of its prior training, rather than details that are somehow known to be right. One analogy might be asking a human painter to complete a partial portrait of a person. The painter might be able to guess at the person's likely posture and plausible items of clothing based on the information of the unfinished portrait, but of course the real person who was the model might have been wearing something else entirely. The fact that the completion is plausible and self-consistent doesn't mean that it's correct.