I asked ChatGPT about playing chess: it says tests have shown it makes an illegal move within 10 - 15 moves, even if prompted to play carefully and not make any illegal moves. It'll fail within the first 3 or 4 if you ask it play reasonably quickly.
That means, it can literally never win a chess match, given an intentional illegal move is an immediate loss.
It can't beat a human who can't play chess.
It literally can't even lose properly.
It will disqualify itself every time.
--
> It shows clearly where current models shine (problem-solving)
Yeh - that's not what's happening.
I say that as someone that pays for and uses an LLM pretty much every day.
--
Also - I didn't fact check any of the above about playing chess.
I choose to believe.
Preventing an LLM from making illegal moves should be very simple: provide it with tool access to something that tells it if a move is legal or not, then watch it iterate in a loop until it finds a move that it is allowed to make.
I expect this would dramatically improve the chess playing abilities of the competent tool using models, such as O3.
Nope. The list is very limited. For the starting position:
a3, a4, b3,b4,.......h3, h4,
Na3, Nc3, Nf3, Nh3
That's 20 moves. the size grows a bit in the early middle game, but then drops again in the endgame. There do exist rather artificial positions with more than 200 legal moves, but the average number of legal moves in a position is around 40.
I mentally counted the starting moves as being 8 pawns x2 = 16 pawn moves and 2x2 =4 4 knight moves, but then I doubled it for both sides to get 40 (which with hindsight was obviously wrong) and then assumed that once the pawns had moved a bit there would be more options from non-pawn pieces.
With an upper bound of ~200 in edge cases listing all possible moves wouldn't take up much room in the context at all. I wonder if it would give better results, too.
You could also constrain the output grammar to legal moves, but if we're comparing its chess performance to humans', it would be unfair to not let it think.
I have tried playing chess with ChatGPT a couple of times recently, and I found it was making illegal moves after about 4 or 5 moves.
The first few could be resolved by asking it to check its moves. After a few more, I was having to explain that knights can jump and therefore can’t be blocked. It was also trying to move pieces that weren’t there, onto squares alert occupied by its own pieces, and asking it to review was not getting anywhere.
10-15 moves is very optimistic, unless it’s counting each move by either side, i.e., White moves 5-8 times and Black moves 5-8 times. Even that seems optimistic, but the lower end could be right.
I just tried again, and ChatGPT did much better. A notification said it was using GPT-4o mini, and it reached move 10 for White (me) before it lost the plot:
It didn't get much further with suggestions to review. Also, the small ASCII board it generated was incorrect much earlier, but it sometimes plays without that, so I let that go.
That means, it can literally never win a chess match, given an intentional illegal move is an immediate loss.
It can't beat a human who can't play chess. It literally can't even lose properly. It will disqualify itself every time.
--
> It shows clearly where current models shine (problem-solving)
Yeh - that's not what's happening.
I say that as someone that pays for and uses an LLM pretty much every day.
--
Also - I didn't fact check any of the above about playing chess. I choose to believe.