How much can we trust the thinking trace? At most it says what's in its training set, but Anthropic showed that's not necessarily accurate for how it gets to its answer
I tried this with a (what I thought was) very generic street image in Bangkok. It guessed the city correctly, saying that "people are wearing yellow which is used to honor the monarchy". Wow, cool. I checked the image again and there's a small Thai flag it didn't mention at all. Seems just as plausible, even likely it picked up on that
I trust the thinking trace to show me the Python it runs.
(Though interestingly I believe there are cases where it can run Python without showing you, which is frustrating especially as I don't fully understand what those are. But I showed other evidence that it can do this without EXIF.)
In your example there I wouldn't be at all surprised if it used the flag without mentioning it. The non-code parts of the thinking traces are generally suspicious.
I couldn't attach the chat directly since it's a temporary chat.