> completely irrelevant and I’m only interested in writing (chat, stories, etc) ...

afro88 · 2025-01-26T18:39:44 1737916784

Here's a better link: https://eqbench.com/creative_writing.html

The R1 sample reads way better than anything else on the leaderboard to me. Quite a jump.

polynomial · 2025-01-28T09:40:13 1738057213

Why is the main character named Rhys in most (?) of them? Llama[1], Claude[3], Mistral[4] & DeepSeek-r1[5] samples all named the main character Rhys, even though that was no where specified in the prompt? GPT-4o gives the character a different name[6]. Gemini[2] names the bookshop person Rhys instead! Am I just missing something really obvious? I feel like I'm missing something big that's right in front of me

[1] https://eqbench.com/results/creative-writing-v2/meta-llama__... [2] https://eqbench.com/results/creative-writing-v2/gemini-1.5-f... [3] https://eqbench.com/results/creative-writing-v2/claude-3-opu... [4] https://eqbench.com/results/creative-writing-v2/mistralai__M... [5] https://eqbench.com/results/creative-writing-v2/deepseek-ai_... [6] https://eqbench.com/results/creative-writing-v2/gpt-4o-2024-...

exikyut · 2025-01-27T10:05:50 1737972350

Completely agree.

The only measurable flaw I could find was the errant use of an opening quote (‘) in

> He huffed a laugh. "Lucky you." His gaze drifted to the stained-glass window, where rain blurred the world into watercolors. "I bombed my first audition. Hamlet, uni production. Forgot ‘to be or not to be,' panicked, and quoted Toy Story."

It's pretty amazing I can find no fault with the actual text. No grammar errors, I like the writing, it competes with the quality and engagingness of a large swath of written fiction (yikes), I wanna read the next chapter.

NitpickLawyer · 2025-01-27T11:47:21 1737978441

> It's pretty amazing I can find no fault with the actual text. No grammar errors, I like the writing, it competes with the quality and engagingness of a large swath of written fiction (yikes), I wanna read the next chapter.

The lack of "gpt-isms" is really impressive IMO.

comrade1234 · 2025-01-26T15:52:41 1737906761

Those outputs are really good and come from deepseek-R1 (I assume the full version, not a distilled version).

R1 is quite large (685B params). I’m wondering if you can make a distilled R1 without the coding and math content. 7B works well for me locally. When I go up to 32B I seem to get worse results - I assume it’s just timing out in its think mode… I haven’t had time to really investigate though.