Yeah we evaluated several models for grading ~1 year ago and concluded Mixtral was the best choice for us, as it was the best model yielding the best results that we could self-host and distribute the load of grading 1.2M+ answers over several GPU Servers.
We would have liked to pick a neutral model like Gemini which was fast, reliable and low cost, unfortunately it gave too many poor answers good grades [1]. If we had to pick a new grading model now, hopefully the much improved Gemini Flash 2.0 might yield better results.
There are a lot of interesting options. Gemini 2 Flash isn't ready yet (the current limits are 10 RPM and 1500 RPD) but it could definitely work. An alternative might be using a fine tuned model - I've heard good things about OpenAI fine tuning with even a few examples.
Honestly, the fact that you used an LLM to grade the answers at all is enough to make me discount your results entirely. That it showed obvious preference to the model with which it shares weights is just a symptom of the core problem, which is that you had to pick a model to trust before you even ran the benchmarks.
The only judges that matter at this stage are humans. Maybe someday when we have models that humans agree are reliably good you could use them to judge lesser-but-cheaper models.
Yup, I did an experiment a long time ago, where I wanted best of 2. I had Wizard, Mistral & Llama. They would generate responses and I would pass the response to all 3 models to vote. I would pass it in to a new prompt without reference to previous prompt, 95%+ of the time, they all voted for their own response even when it was clear there was a better response. LLM as a judge is a joke.
The Mixtral grading model calculates the original starting votes which can be further influenced by Users voting on their preferred answer which affects the leaderboard standings.
It should be noted that Mixtral 8x7B didn't grade its own model very high at 11th, it's standout was grading Microsoft's WizardLM2 model pretty high at #2. Although it's not entirely without merit as at the time of release it was Microsoft's most advanced model and the best opensource LLM available [1]. Which we also found generated great high quality answers which I'm surprised it's not more used as it's only OpenRouter's 15th most used model this month [2], although it's received very little marketing behind it, essentially just an announcement blog post.
Whilst nothing is perfect we're happy with the Grading system as it's still able to identify good answers from bad ones, good models from bad ones and which topics models perform poorly on. Some of the grades are surprising since we have prejudices on where models should rank before the results are concluded, which is also why it's important to have multiple independent benchmarks, especially benchmarks that LLMs aren't optimized for as I've often been disappointed by how some models perform in practice vs how well they perform in benchmarks.
Either way you can inspect the different answers from the different models yourself by paging through the popular questions [3]:
We would have liked to pick a neutral model like Gemini which was fast, reliable and low cost, unfortunately it gave too many poor answers good grades [1]. If we had to pick a new grading model now, hopefully the much improved Gemini Flash 2.0 might yield better results.
[1] https://pvq.app/posts/individual-voting-comparison#gemini-pr...