related: I imagine in the future we might several "expert" LLM's and a wrapper c...

Arcuru · 2025-04-30T16:59:57 1746032397

For a concrete example today, see https://openrouter.ai/openrouter/auto

simianwords · 2025-04-30T17:50:40 1746035440

thats nice but imagine first having models that are expert in specific domains. routing seems to be the easy part (just feed the available models as tools to your wrapper LLM)

samvaran · 2025-04-30T17:19:21 1746033561

Is that not what MoE models already do?

AlexCoventry · 2025-04-30T20:04:27 1746043467

MoE models route each token, in every transformer layer, to a set of specialized feed-forward networks (fully-connected perceptrons, basically), based on a score derived from the token's current representation.

neom · 2025-05-01T00:07:02 1746058022

Good visual explainer in here: https://deepgram.com/learn/mixture-of-experts-ml-model-guide

oofbaroomf · 2025-04-30T17:23:39 1746033819

No. Each expert is not separately trained, and while they may store different concepts, they are not meant to be different experts in specific domains. However, there are certain technologies to route requests to different ___domain expert LLMs or even fine-tuning adapters, such as RouteLLM.

woah · 2025-04-30T22:46:08 1746053168

Why do you think that a hand-configured selection between "different domains" is better than the training-based approach in MoE?

oofbaroomf · 2025-05-01T03:26:35 1746069995

First off, they are basically completely different technologies, so it would be disingenuous to act like it's an apples-to-apples comparison.

But a simple way to see it is that when you pick between multiple large models that have different strengths, you have a larger amount of parameters just to work with (e.g. Deepseek R1 + V3 + Qwen + LLaMA ends up being 2 trillion total parameters to pick from), whereas "picking" the experts in an MoE like has a smaller amount of total different parameters you are working with (e.g. R1 is 671 billion, Qwen is 235).

retinaros · 2025-04-30T17:26:47 1746034007

That might already happen behind what they call test time compute

oofbaroomf · 2025-05-01T03:29:08 1746070148

Many models that use test time compute are MoEs, but test-time compute is generally meant to refer to reasoning about the prompt/problem the model is given, not about reasoning about which model to pick, and I don't think anyone has released an LLM router under that name.

retinaros · 2025-05-01T07:05:51 1746083151

we dont know what OAI does to find the best answer when reasoning but I am pretty sure that having variations of a same model is part of it.

someguy101010 · 2025-04-30T17:05:02 1746032702

The No Free Lunch Theorem implies that something like this is inevitable https://en.wikipedia.org/wiki/No_free_lunch_in_search_and_op...

repsilat · 2025-04-30T17:29:00 1746034140

A system of n experts is no different to a single expert wrt the NFLT. The theorem is entirely indifferent to (ie "equally skeptical of") the idea.

koakuma-chan · 2025-04-30T16:43:23 1746031403

> related: I imagine in the future we might several "expert" LLM's and a wrapper can delegate tasks as needed as if it were a "tool". That way we can have segregation of expertise - each individual model can excel at one single thing.

In the future? I'm pretty sure people do that already.

simianwords · 2025-04-30T17:47:06 1746035226

No I disagree. I would want ChatGPT to abstract away expert models - biochemistry model, coding model, physics model and maybe O3 would use these models as tools to come up with an answer.

The point being that a separate expert model would be better at its own field than a single model that tries to be good at everything. Intuitively it makes sense, in practice I have seen anecdotes where finetuning a small model on ___domain data makes the model lose coherence on other topics.

koakuma-chan · 2025-04-30T19:42:18 1746042138

> have seen anecdotes where finetuning a small model on ___domain data makes the model lose coherence on other topics

This is expected behaviour.

simianwords · 2025-04-30T19:51:52 1746042712

i know. so why don't we have ___domain specific models as tools in consumer llm products

energy123 · 2025-04-30T16:57:24 1746032244

It's crudely done though.

kratom_sandwich · 2025-04-30T17:18:39 1746033519

Mistrals model is a mixture-of-experts model