It's pretty funny to test in-distribution for AI models. But they fail horribly ...

It's pretty funny to test in-distribution for AI models. But they fail horribly once you push them a bit[1].

I recently made LLMs play Minesweeper and ALL LLMs that I tested had a pretty bad win to loose ratio. Like the only model that won more than 3 times was R1 (mind you there were 50 games).

[1] https://snats.xyz/pages/articles/minesweeper_bench.html