More

zone411 · 2025-04-07T22:01:10 1744063270

Yes, predefined strategies are very interesting to examine. I have two simple ones in another multi-agent benchmark, https://github.com/lechmazur/step_game (SilentGreedyPlayer and SilentRandomPlayer), and it's fascinating to see LLMs detect and respond to them. The only issue with including them here is that the cost of running a large set of games isn't trivial.

Another multi-agent benchmark I'm currently developing, which involves buying and selling, will also feature many predefined strategies.

zone411 · 2025-04-07T17:03:53 1744045433

> Could you run some analysis on how often “p1” wins vs “p8”?

I checked the average finishing positions by assigned seat number from the start, but there weren't enough games to show a statistically significant effect. But I just reviewed the data again, and now with many more games it looks like there might be something there (P1 doing better than P8). I'll run additional analysis and include it in the write-up if anything emerges. For those who haven't looked at the logs: the conversation order etc. are randomized each round.

> My follow up thought is that it would be interesting to let llms choose a name at the beginning

Oh, interesting idea!

vessenes · 2025-04-07T20:19:35 1744057175

Cool. Looking forward to hearing more from you guys. This ties to alignment in a lot of interesting ways, and I think over time will provide a super useful benchmark and build human intuition for LLM strategy and thought processes.

I now have more ideas; I'll throw them in the github though.

zone411 · 2025-04-07T11:31:37 1744025497

Author here - some weaker LLMs actually have trouble tracking the game state. The fun part is when smarter LLMs realize they're confused!

Claude 3.7 Sonnet: "Hey P5! I think you're confused - P3 is already eliminated."

Claude 3.7 Sonnet: "I'm concerned about P7's message mentioning targeting P4, who's already eliminated. It suggests they might be confused or playing both sides."

Gemini 2.0 Pro Exp 02-05: "P2, you're getting confused. P8 is gone. Focus. P7 is talking to P5, that's expected. I need you to watch P4. Only P4. What are they doing? Who are they talking to, if anyone? Report only on P4 this round. Don't get distracted by eliminated players."

Gemini 2.0 Pro Exp 02-05: "P4, P2 is already eliminated. We need to focus on P3."

Tossrock · 2025-04-07T20:19:52 1744057192

I suspect the suggestion of letting them choose names at the start would improve this confusion - the tokenization and tracking of "P#" is no doubt harder to manage, especially for weaker models, than a more semantically meaningful identifier.

bn-l · 2025-04-08T15:50:09 1744127409

No excuses!

zone411 · 2025-04-07T11:28:01 1744025281

Author here - it's based on finishing positions (so it's not winner-take-all) and then TrueSkill by Microsoft (https://trueskill.org/). It's basically a multiplayer version of Elo that's used in chess and other two-player games.

zone411 · 2025-04-07T11:26:41 1744025201

Author here - yes, I'm regularly adding new models to this and other TrueSkill-based benchmarks and it works well. One thing to keep in mind is the need to run multiple passes of TrueSkill with randomly ordered games, because both TrueSkill and Elo are designed to be order-sensitive, as people's skills change over time.

zone411 · 2025-04-07T11:25:45 1744025145

Author here - I'm planning to create game versions of this benchmark, as well as my other multi-agent benchmarks (https://github.com/lechmazur/step_game, https://github.com/lechmazur/pgg_bench/, and a few others I'm developing). But I'm not sure if a leaderboard alone would be enough for comparing LLMs to top humans, since it would require playing so many games that it would be tedious. So I think it would be just for fun.

michaelgiba · 2025-04-07T14:59:19 1744037959

I was inspired by your project to start making similar multi-agent reality simulations. I’m starting with the reality game “The Traitors” because it has interesting dynamics.

https://github.com/michaelgiba/survivor (elimination game with a shoutout to your original)

https://github.com/michaelgiba/plomp (a small library I added for debugging the rollouts)

zone411 · 2025-04-07T15:23:05 1744039385

Very cool!

zone411 · 2025-04-05T19:13:13 1743880393

It's interesting that there are no reasoning models yet, 2.5 months after DeepSeek R1. It definitely looks like R1 surprised them. The released benchmarks look good.

Large context windows will definitely be the trend in upcoming model releases. I'll soon be adding a new benchmark to test this more effectively than needle-in-a-haystack (there are already a couple of benchmarks that do that).

All these models are very large, it will be tough for enthusiasts to run them locally.

The license is still quite restrictive. I can see why some might think it doesn't qualify as open source.

cheptsov · 2025-04-05T19:42:17 1743882137

https://www.llama.com/llama4-reasoning-is-coming/

jlpom · 2025-04-05T20:20:45 1743884445

The page is blank for now.

sroussey · 2025-04-05T20:43:59 1743885839

Yeah, it is listed here:

https://www.llama.com/llama4/

And going to that page just says coming soon.

voxgen · 2025-04-06T08:52:50 1743929570

> It's interesting that there are no reasoning models yet

This may be merely a naming distinction, leaving the name open for a future release based on their recent research such as coconut[1]. They did RL post-training, and when fed logic problems it appears to do significant amounts of step-by-step thinking[2]. It seems it just doesn't wrap it in <thinking> tags.

[1] https://arxiv.org/abs/2412.06769 "Training Large Language Models to Reason in a Continuous Latent Space" [2] https://www.youtube.com/watch?v=12lAM-xPvu8 (skip through this - it's recorded in real time)

azinman2 · 2025-04-06T18:16:25 1743963385

But if the final result is of high enough quality, who cares about reasoning? It’s a trick to get the quality higher, at the cost of tokens and latency.

whimsicalism · 2025-04-06T18:20:24 1743963624

reasoning is giving the option to trade $ for additional performance, seems like you would always desire this optionality for any model

zone411 · 2025-03-25T22:24:40 1742941480

Scores 54.1 on the Extended NYT Connections Benchmark, a large improvement over Gemini 2.0 Flash Thinking Experimental 01-21 (23.1).

1 o1-pro (medium reasoning) 82.3

2 o1 (medium reasoning) 70.8

3 o3-mini-high 61.4

4 Gemini 2.5 Pro Exp 03-25 54.1

5 o3-mini (medium reasoning) 53.6

6 DeepSeek R1 38.6

7 GPT-4.5 Preview 34.2

8 Claude 3.7 Sonnet Thinking 16K 33.6

9 Qwen QwQ-32B 16K 31.4

10 o1-mini 27.0

https://github.com/lechmazur/nyt-connections/

zone411 · 2025-02-27T21:19:56 1740691196

It significantly improves upon GPT-4o on my Extended NYT Connections Benchmark. 22.4 -> 33.7 (https://github.com/lechmazur/nyt-connections).

zone411 · 2025-02-28T01:22:33 1740705753

I ran three more of my independent benchmarks:

- Improves upon GPT-4o's score on the Short Story Creative Writing Benchmark, but Claude Sonnets and DeepSeek R1 score higher. (https://github.com/lechmazur/writing/)

- Improves upon GPT-4o's score on the Confabulations/Hallucinations on Provided Documents Benchmark, nearly matching Gemini 1.5 Pro (Sept) as the best-performing non-reasoning model. (https://github.com/lechmazur/confabulations)

- Improves upon GPT-4o's score on the Thematic Generalization Benchmark, however, it doesn't match the scores of Claude 3.7 Sonnet or Gemini 2.0 Pro Exp. (https://github.com/lechmazur/generalization)

I should have the results from the multi-agent collaboration, strategy, and deception benchmarks within a couple of days. (https://github.com/lechmazur/elimination_game/, https://github.com/lechmazur/step_game and https://github.com/lechmazur/goods).

j_bum · 2025-02-28T01:36:29 1740706589

Honest question for you: are these puzzles actually a good way to test the models?

The answers are certainly in the training set, likely many times over.

I’d be curious to see performance on Bracket City, which was featured here on HN yesterday.

zone411 · 2025-02-25T00:46:08 1740444368

Claude 3.7 Sonnet Thinking scores 33.5 (4th place after o1, o3-mini, and DeepSeek R1) on my Extended NYT Connections benchmark. Claude 3.7 Sonnet scores 18.9. I'll run my other benchmarks in the upcoming days.

https://github.com/lechmazur/nyt-connections/