Yes, predefined strategies are very interesting to examine. I have two simple ones in another multi-agent benchmark, https://github.com/lechmazur/step_game (SilentGreedyPlayer and SilentRandomPlayer), and it's fascinating to see LLMs detect and respond to them. The only issue with including them here is that the cost of running a large set of games isn't trivial.
Another multi-agent benchmark I'm currently developing, which involves buying and selling, will also feature many predefined strategies.
> Could you run some analysis on how often “p1” wins vs “p8”?
I checked the average finishing positions by assigned seat number from the start, but there weren't enough games to show a statistically significant effect. But I just reviewed the data again, and now with many more games it looks like there might be something there (P1 doing better than P8). I'll run additional analysis and include it in the write-up if anything emerges. For those who haven't looked at the logs: the conversation order etc. are randomized each round.
> My follow up thought is that it would be interesting to let llms choose a name at the beginning
Cool. Looking forward to hearing more from you guys. This ties to alignment in a lot of interesting ways, and I think over time will provide a super useful benchmark and build human intuition for LLM strategy and thought processes.
I now have more ideas; I'll throw them in the github though.
Author here - some weaker LLMs actually have trouble tracking the game state. The fun part is when smarter LLMs realize they're confused!
Claude 3.7 Sonnet: "Hey P5! I think you're confused - P3 is already eliminated."
Claude 3.7 Sonnet: "I'm concerned about P7's message mentioning targeting P4, who's already eliminated. It suggests they might be confused or playing both sides."
Gemini 2.0 Pro Exp 02-05: "P2, you're getting confused. P8 is gone. Focus. P7 is talking to P5, that's expected. I need you to watch P4. Only P4. What are they doing? Who are they talking to, if anyone? Report only on P4 this round. Don't get distracted by eliminated players."
Gemini 2.0 Pro Exp 02-05: "P4, P2 is already eliminated. We need to focus on P3."
I suspect the suggestion of letting them choose names at the start would improve this confusion - the tokenization and tracking of "P#" is no doubt harder to manage, especially for weaker models, than a more semantically meaningful identifier.
Author here - it's based on finishing positions (so it's not winner-take-all) and then TrueSkill by Microsoft (https://trueskill.org/). It's basically a multiplayer version of Elo that's used in chess and other two-player games.
Author here - yes, I'm regularly adding new models to this and other TrueSkill-based benchmarks and it works well. One thing to keep in mind is the need to run multiple passes of TrueSkill with randomly ordered games, because both TrueSkill and Elo are designed to be order-sensitive, as people's skills change over time.
Author here - I'm planning to create game versions of this benchmark, as well as my other multi-agent benchmarks (https://github.com/lechmazur/step_game, https://github.com/lechmazur/pgg_bench/, and a few others I'm developing). But I'm not sure if a leaderboard alone would be enough for comparing LLMs to top humans, since it would require playing so many games that it would be tedious. So I think it would be just for fun.
I was inspired by your project to start making similar multi-agent reality simulations. I’m starting with the reality game “The Traitors” because it has interesting dynamics.
It's interesting that there are no reasoning models yet, 2.5 months after DeepSeek R1. It definitely looks like R1 surprised them. The released benchmarks look good.
Large context windows will definitely be the trend in upcoming model releases. I'll soon be adding a new benchmark to test this more effectively than needle-in-a-haystack (there are already a couple of benchmarks that do that).
All these models are very large, it will be tough for enthusiasts to run them locally.
The license is still quite restrictive. I can see why some might think it doesn't qualify as open source.
> It's interesting that there are no reasoning models yet
This may be merely a naming distinction, leaving the name open for a future release based on their recent research such as coconut[1]. They did RL post-training, and when fed logic problems it appears to do significant amounts of step-by-step thinking[2]. It seems it just doesn't wrap it in <thinking> tags.
But if the final result is of high enough quality, who cares about reasoning? It’s a trick to get the quality higher, at the cost of tokens and latency.
- Improves upon GPT-4o's score on the Short Story Creative Writing Benchmark, but Claude Sonnets and DeepSeek R1 score higher. (https://github.com/lechmazur/writing/)
- Improves upon GPT-4o's score on the Confabulations/Hallucinations on Provided Documents Benchmark, nearly matching Gemini 1.5 Pro (Sept) as the best-performing non-reasoning model. (https://github.com/lechmazur/confabulations)
- Improves upon GPT-4o's score on the Thematic Generalization Benchmark, however, it doesn't match the scores of Claude 3.7 Sonnet or Gemini 2.0 Pro Exp. (https://github.com/lechmazur/generalization)
Claude 3.7 Sonnet Thinking scores 33.5 (4th place after o1, o3-mini, and DeepSeek R1) on my Extended NYT Connections benchmark. Claude 3.7 Sonnet scores 18.9. I'll run my other benchmarks in the upcoming days.
Another multi-agent benchmark I'm currently developing, which involves buying and selling, will also feature many predefined strategies.