related: I imagine in the future we might several "expert" LLM's and a wrapper can delegate tasks as needed as if it were a "tool". That way we can have segregation of expertise - each individual model can excel at one single thing.
A prover model might be used as a tool in the coming future.
thats nice but imagine first having models that are expert in specific domains. routing seems to be the easy part (just feed the available models as tools to your wrapper LLM)
MoE models route each token, in every transformer layer, to a set of specialized feed-forward networks (fully-connected perceptrons, basically), based on a score derived from the token's current representation.
No. Each expert is not separately trained, and while they may store different concepts, they are not meant to be different experts in specific domains. However, there are certain technologies to route requests to different ___domain expert LLMs or even fine-tuning adapters, such as RouteLLM.
First off, they are basically completely different technologies, so it would be disingenuous to act like it's an apples-to-apples comparison.
But a simple way to see it is that when you pick between multiple large models that have different strengths, you have a larger amount of parameters just to work with (e.g. Deepseek R1 + V3 + Qwen + LLaMA ends up being 2 trillion total parameters to pick from), whereas "picking" the experts in an MoE like has a smaller amount of total different parameters you are working with (e.g. R1 is 671 billion, Qwen is 235).
Many models that use test time compute are MoEs, but test-time compute is generally meant to refer to reasoning about the prompt/problem the model is given, not about reasoning about which model to pick, and I don't think anyone has released an LLM router under that name.
> related: I imagine in the future we might several "expert" LLM's and a wrapper can delegate tasks as needed as if it were a "tool". That way we can have segregation of expertise - each individual model can excel at one single thing.
In the future? I'm pretty sure people do that already.
No I disagree. I would want ChatGPT to abstract away expert models - biochemistry model, coding model, physics model and maybe O3 would use these models as tools to come up with an answer.
The point being that a separate expert model would be better at its own field than a single model that tries to be good at everything. Intuitively it makes sense, in practice I have seen anecdotes where finetuning a small model on ___domain data makes the model lose coherence on other topics.
A prover model might be used as a tool in the coming future.