On a similar note, I just updated LLM Chess Puzzles repo [1] yesterday.
The fact that gpt-4.5 gets 85% correctly solved is unexpected and somewhat scary (if model was not trained on this).
So... it failed to solve the puzzle? That seems distinctly unimpressive, especially for a puzzle with a fixed start state and a limited set of possible moves.
I asked ChatGPT about playing chess: it says tests have shown it makes an illegal move within 10 - 15 moves, even if prompted to play carefully and not make any illegal moves. It'll fail within the first 3 or 4 if you ask it play reasonably quickly.
That means, it can literally never win a chess match, given an intentional illegal move is an immediate loss.
It can't beat a human who can't play chess. It literally can't even lose properly. It will disqualify itself every time.
--
> It shows clearly where current models shine (problem-solving)
Yeh - that's not what's happening.
I say that as someone that pays for and uses an LLM pretty much every day.
--
Also - I didn't fact check any of the above about playing chess. I choose to believe.
Interesting. Personally my thought process was like that:
- Check obvious, wrong moves.
- Ask what I need to have to win the game even if there's just black king left. Answer is I need all 3 pieces to win some day even if there's just black king on the board.
- So any moves that makes me lose my pawn or rook result in failure.
- So the only thing I can do with the rook is move it vertically. Any horizontal move allows black to take my pawn. King and pawn don't have much options and all result in pawn loss or basically skipping a turn while changing situation a little bit for the worse that makes mate in one move unlikely.
- Taking a pawn with rook results in loss of the rook which is just as bad.
- Let's look at spot next to the pawn. I'll still protect my pawn, but my rook is in danger. But if black takes rook, I can just move my pawn forward to get a mate. If they don't I can move rook forward and get a mate. Solved.
So I skipped trying to run a program and googling part, not because it didn't came to my mind but because I wanted different kind of challenge then challenge of extracting information from the internet or challenge of running a unfamiliar piece of software.
It's weird to me that the author says this behavior feels human because it's nothing like how I solve this puzzle.
At no point during my process would I be counting pixels in the image. It feels very clearly like a machine that mimics human behavior without understanding where that behavior comes from.
"o3 does not just spit out an answer. It reasons. It struggles. It switches tools. It self-corrects. Sometimes it even cheats, but only after exhausting every other option. That feels very human."
I've never met a human player that suddenly says 'OK, I need Python to figure out my next move'.
I'm not a good player, usually I just do ten minute matches against the weakest Stockfish settings so as not to be annoying to a human, and I figured this one out in a couple of minutes because there are very few options. Taking with the rook doesn't work, taking with the pawn also doesn't, so it has to be a non-taking move, and the king can't do anything useful so it has to be the rook and typically in these puzzles it's a sacrifice that unlocks the solution. And it was.
So, are we talking about OpenAI o3 model, right?
My thought process was, let me try out a few moves that attack Black. There aren't that many, so... Kb8? No, not allowed. Kb7? No, not allowed. Pa7? No, that doesn't really help. Ra7? No, that doesn't lead to anything useful. Ra6? Oh, yeah, that works.
But I wasn't thinking in text, I was thinking graphically. I was visualizing the board. It's not beyond the realm of possibility that you can tokenize graphics. When is that coming?
On a sidenote, I tried to get Codex + O3 to make an existing sidebar toggable with Tailwind CSS and it made an abomination full of bugs. This is a classic "boilerplate" task I'd expect it to be able to do. Not sure if I'm doing it wrong but... a little bit more direct instructions to O4-mini and it managed. The cost was astronomical tho compared to Anthropic.
This is cool, I built a AI Chess Coach to analyze games: https://chesscoach.dev/
LLMs are not chess engines, similar to how they don’t really calculate arithmetic. What’s new? carry on.
O3 is massively underwhelming and is obviously tuned to be sycophantic.
Claude reigns supreme.
BTW can someone tell me how do you who you are here? I'm reading:
> Chess Puzzle Checkmate in 2 White
does it mean we are white, or does it mean we're trying to checkmate white?
Is this that impressive considering these models have probably been trained on numerous books/texts analyzing thousands of games (including morphy's)?
I remember reading that got3.5-turbo instruct was oddly good at chess - would be curious what it outputs as a next two moves here.
Nice puzzle with a twist of Zugzwang. Took me about 8 minutes, but it's been decades since I was doing chess.
Where does this obsession over giving binary logic tasks to LLMs come from ? New LLM breakthroughs are about handling blurry logic, non precise requirements and spitting vague human realistic outputs. Who care how well it can add integers or solve chess puzzles ? We have decades of computer science on those topics already
Because its trained on human data.
I've commited the 03 (zero-three) and not o3 (o-three) typo too, but can we rename it on the title please
I just tried the same puzzle in o3 using the same image input, but tweaked the prompt to say “don’t use the search tool”. Very similar results!
It spent the first few minutes analyzing the image and cross-checking various slices of the image to make sure it understood the problem. Then it spent the next 6-7 minutes trying to work through various angles to the problem analytically. It decided this was likely a mate-in-two (part of the training data?), but went down the path that the key to solving the problem would be to convert the position to something more easily solvable first. At that point it started trying to pip install all sorts of chess-related packages, and when it couldn’t get that to work it started writing a simple chess solver in Python by hand (which didn’t work either). At one point it thought the script had found a mate-in-six that turned out to be due to a script bug, but I found it impressive that it didn’t just trust the script’s output - instead it analyzed the proposed solution and determined the nature of the bug in the script that caused it. Then it gave up and tried analyzing a bit more for five more minutes, at which point the thinking got cut off and displayed an internal error.
15 minutes total, didn’t solve the problem, but fascinating! There were several points where if the model were more “intelligent”, I absolutely could see it reasoning it out following the same steps.