Hacker News new | past | comments | ask | show | jobs | submit login

It's pretty funny to test in-distribution for AI models. But they fail horribly once you push them a bit[1].

I recently made LLMs play Minesweeper and ALL LLMs that I tested had a pretty bad win to loose ratio. Like the only model that won more than 3 times was R1 (mind you there were 50 games).

[1] https://snats.xyz/pages/articles/minesweeper_bench.html




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: