It’d be great if someone would do that with the same data and prompt to other mo...

simonw · 2025-02-28T00:01:48 1740700908

Good call. Here's the same exact prompt run against:

GPT-4o: https://gist.github.com/simonw/592d651ec61daec66435a6f718c06...

GPT-4o Mini: https://gist.github.com/simonw/cc760217623769f0d7e4687332bce...

Claude 3.7 Sonnet: https://gist.github.com/simonw/6f11e1974e4d613258b3237380e0e...

Claude 3.5 Haiku: https://gist.github.com/simonw/c178f02c97961e225eb615d4b9a1d...

Gemini 2.0 Flash: https://gist.github.com/simonw/0c6f071d9ad1cea493de4e5e7a098...

Gemini 2.0 Flash Lite: https://gist.github.com/simonw/8a71396a4a219d8281e294b61a9d6...

Gemini 2.0 Pro (gemini-2.0-pro-exp-02-05): https://gist.github.com/simonw/112e3f4660a1a410151e86ec677e3...

Topfi · 2025-02-28T08:06:05 1740729965

Thanks for sharing. To me, purely on personal preference, the Gemini models did best on this task, which also fits with my personal experience using Googles models to summarize extensive, highly specialized text. Geminis 2.0 models do especially well on Needle in Haystack type tests in my experience.

iamjs · 2025-02-28T00:44:46 1740703486

At a glance, none of these appear to be meaningfully worse than GPT-4.5

mastercheif · 2025-02-28T03:53:59 1740714839

Seeing the other models, I actually come away impressed with how well GPT-4.5 is organizing the information and how well it reads. I find it a lot easier to quickly parse. It's more human-like.

unoti · 2025-02-28T05:14:39 1740719679

I noticed 4o mini didn't follow the directions to quote users. My favourite part of the 4.5 summary was how it quoted Antirez. 4o mini brought out the same quote, but failed to attribute it as instructed.

Agentlien · 2025-02-28T08:29:09 1740731349

It's fascinating, but while this does mean it strays from the given example, I actually feel the result is a better summary. The 4.5 version is so long you might just read the whole thread yourself.

jwr · 2025-02-28T04:11:59 1740715919

I actually think the Claude 3.7 Sonnet summary is better.

hexa00 · 2025-02-28T19:59:01 1740772741

yeah I liked it too, especially for 10x less the price lol

NitpickLawyer · 2025-02-28T07:53:19 1740729199

Interesting, thanks for doing this. I'd say that (at a glance) for now it's still worth to use more passes with smaller models than one pass with 4.5

Now, if you'd want to generate training data, I could see wanting to have the best answers possible, where even slight nuances would matter. 4.5 seems to adhere to instructions much better than the others. You might get the same result w/ generating n samples and "reflect" on them with a mixture of models, but then again you might not. Going through thousands of generations manually is also costly.

Agentlien · 2025-02-28T08:27:12 1740731232

Compared to GPT-4.5 I prefer the GPT-4o version because it is less wordy. It summarizes and gives the gist of the conversation rather than reproducing it along with commentary.