It’s really surprising how well this works. Intuitively this illustrates how over-parameterized many LLMs are, or conversely how under-trained they might be.
The drop and rescale method outlined in the paper makes the latent space increasingly sparse, which in turn allows weights to merged without much interference or degradation.
My instinct is that while merging models will have some use cases, ultimately these insights will lead to innovations in training and architecture that have the same result but with better computational efficiency.
For example instead of training an 8x7b mixture of experts then merging, just incorporate the sparsity constraint while pre-training a single 7b model (somehow).
I tend to think there's an explore/exploit trade-off in model scale. Animals, including humans, have lots of extra neurons when they are young, which are then shed once the learning phase settles down. And this makes sense: thinking is energy intensive, so it's more efficient to sparsify. And, of course, we see a similar dynamic in ML: you can train a big model then prune it and do much better than you would by just training the smaller model directly.
I've got some geometric handwaving for why this works, as well. It's easier to find a low-energy solution when you have more parameters... Sparse solutions are higher energy, and thus require longer walks (and more commitment) during training.
Do you know if there is anybody who has made it their mission to shrink models to extremes? It feels like the sorta thing somebody would get really obsessed with doing, akin to those people who make executable out of a few bytes or shrink network payloads.
Some time ago, Andrej Karpathy posted a YouTube video about building GPT from scratch. Using the same dataset (a 1MB text file with all of Shakespeare’s works), the video goes from a very basic model to something like GPT2
What really stood out to me is that, while the typical wisdom is “more data -> better model”, in the video they use the exact same data for all the models, so all the gains are achieved only through improved algorithms and computation power
With that in mind, I wonder if at some point someone will figure out an algorithm for a model that can be trained on just an English dictionary and get a decent LLM-equivalent model that can have a basic conversation. Given the small size of the data (a dictionary), I assume the model would be pretty small and quite fast to run, even on older or smaller machines
Not for DL models, but I was exploring doing this in a specific setting (I've posted about this earlier). For small-sized models (for some reasonable definition of size, e.g., depth of a decision tree, or # trees in a gradient boosting forest, or # non-zero coefficients in a linear model), I realized that you could make them even smaller while retaining their accuracy by selectively presenting training data to them, i.e., ignore some training data points, repeat certain others. See [1] - the x-axis shows the original model size and the y-axis shows the model size obtained by this process at the same or better accuracy.
Interestingly, this meant that the conventional wisdom that the test and train distributions have to be identical for optimal held-out performance, is not true at small model sizes. It is true as models grow larger - and I was explicitly able to show this. See [2] - where the x-axis is model size, and the y-axis measures (on a scale of 0-1) how close is the optimal training distribution to the test distribution. The different lines are for different datasets.
These images are from the paper here [3]. I have a library too [4]; that is in need of updates - it works today as-is though, but please use the latest minor release if this of interest [5]!
Ooh, parameter golfing! maybe I will get into ML one day after all.
On a similar note, I once found a paper that explained artificial neural networks computability from a very tiny ground up method, showing how few pieces you need to build a NAND gate... and of course once you have that you have everything.
> For example instead of training an 8x7b mixture of experts then merging, just incorporate the sparsity constraint while pre-training a single 7b model (somehow).
I'm thinking it would help reduce the network demands for gradient updates if merge from time to time. That could unlock distributed training, like SETI@Home.
I can see that. Unpredictability is the root of much anxiety. In academia, I'm the opposite. I'm thrilled by mystery and the unknown. It gives me this deep sense of profundity and gravitas bordering on religious experience. I'd love to learn more about that phenomenon. Anyway, it reminds me of that famous Newton quote:
"I do not know what I may appear to the world, but to myself I seem to have been only like a boy playing on the sea-shore, and diverting myself in now and then finding a smoother pebble or a prettier shell than ordinary, whilst the great ocean of truth lay all undiscovered before me."
That's certainly an interesting viewpoint on life, but I suspect it's a very rare one simply because most people would never have considered it and those who had would tend towards the curious.
You seem to be trapped in the interminable middle.
I don't feel trapped. There is going to be (and has been in ways) a small reckoning when people realize magic (ai) and all its unpredictability is very difficult to manage.
There is no field of "probabilistic UX" for example. How do you provide a consistent user experience when the underlying engine of your application is inconsistent?
Same goes for QA, testing, root cause analysis.
Adding features can have exponential side effects that cannot be predicted, which can be deadly at scale. Both figuratively and literally depending on how the technology is adopted.
> small reckoning when people realize magic (ai) and all its unpredictability is very difficult to manage.
Deploy it in the right places first. Most people don't realize it's the arts where this works the best. They're too focused on LLMs and reasoning, but the first verticals that work will all be image, video, audio, and games.
If you're imperfect, the human creator driving the creation can easily repair it. Nobody dies, no business is lost. Millions of hours are saved. Large capital intensive businesses get disrupted and democratized.
> There is no field of "probabilistic UX" for example. How do you provide a consistent user experience when the underlying engine of your application is inconsistent?
Sure there is. Every single in person store you go into and how staff are trained to interact with customers. It’s hard, and yet there are clear experts in it.
Mate, why would you need this explained in detail to you? You claimed that there is no field of probabilistic UX and that it would be difficult to "provide a consistent user experience when the underlying engine of your application is inconsistent" -- the other posters just gave you two highly analogous examples where humans have been dealing with inconsistent UX for thousands of years.
The horse is a horse. They were a major part of human transportation and trade networks for a very long time.
The point is that humans have millennia of experience working with and relying on animal intelligence. Animals actually provide an interesting model for contextualizing artificial intelligence; the parallels are inescapable once you start looking at it.
The equivalency breaks down quickly, and even when it's appropriate, it doesn't paint a rosy picture of what interacting with ai will be.
If using an app goes from the modern equivalent of mindless clicking (which has taken billions of dollars to get there, but we've all seen babies use iPads) to the equivalent of personally training and directing an animal, we're going to experience a massive degredation of quality in user experience because of it.
And re-training these models is magnitudes harder than redesigning a predictable UX around user data.
Not to mention people are not promising these models as horses pulling a carriage or walking with a man on top but as doctors and lawyers and psychiatrists and engineers and drone operators.
Sadly, deterministic ux code is unable to do everything. So typically we hire people to do the things computers can't do on their own. Those people have a user interface (speech and text), and a tendency to not perfectly execute your instructions, because the prescribed task is often under specified.
Likewise, humans have used animals to extend their capabilities for millennia, and, like other people, animals have their own sets of capabilities, needs, and shortcomings.
What they have in common, and I think ai will be like this as well, is that to get stuff done well you need to understand the capabilities and strengths of the mind you are working with. Working with minds requires some level of relationship with those minds.
If you have use cases where a CRUD approach is plenty, then you don't need a mind in the implementation, and that's totally fine! Let a mess of JavaScript do what a mess of JavaScript is good at. (though you'll certainly need a mind to implement it).
But we already work with minds that are experts - and smarter and more experienced than we are - in medicine, law, and so on. We are used to knowing that good results in these areas are built on building a bridge with another mind.
I'm curious why you would expect that we would ever want to replace, say, a psychiatrist with a button press? The expressivity possible with deterministic software is simply insufficient to the complexity of the problem space.
So are you saying that it's a good thing to have machines that are supposed to be tools tailor made for our needs that need as much consideration and prodding as an animal needs? Because that feels like hell. Software is already often frustrating enough with smug UX "designers" imposing their changes for the sake of change on users.
Whether anyone thinks it's a "good thing" or not is kind of irrelevant.
You're free to try and build regular deterministic software that can do the things GPT-4 or Midjourney can.
Hint: Some of the greatest minds tried this way for decades and failed. It's so bad that we abandoned GOFAI for NNs in NLP long before the emergence of the likes of GPT. We use Connectist Neural Networks today because they work not because it was our first choice.
The plain truth of the matter is that all the General Real-World Intelligences we know, whether human, animal or now even silicon work this way, with a "probabilistic UI". In fact, the idea that it can work any other way has succeeded only in the realm of fiction and it wasn't for a lack of trying.
That isn't sufficient justification to force these tools into things that require some strict determinism. We've already seen cases of companies being forced to compensate users because their fancy AI toy was unpredictable and gave a user incorrect information: https://www.theguardian.com/world/2024/feb/16/air-canada-cha...
That was only customer service with little harm done, hammering these kinds of tools into medicine, defense or law is going to risk even worse consequences, and I fully expect people drawing false equivalences between animal intelligence and current generation AI will just attempt to dodge responsibility for their reprehensible decisions when they inevitably cause losses of life.
Lowering the stakes a bunch and coming back down to say, apps, if I have to talk to my phone to get it to do stuff that I used to just be able to do without having to hold a conversation with it, it'd be a clear downgrade. The reason Google Assistant and Siri on phones remained mostly just gimmicks is that it's just faster to search for something or enter an appointment manually than to ask for it to schedule it when you have a physical interface in your hand already. Forcing things like that might work out in the short term for our tech oligopolies, but it's only going to further increase the push for breaking them up.
As an example, I presented on the current state of LLMs to a group of non-techies at my workplace about how we could leverage LLMs to enhance our productivity, and the discussion with others who had tried them also concluded that they were simply too unpredictable to trust for anything where a human isn't immediately checking the output.
I fully expect that all these irresponsible false equivalencies will lead to software, and especially AI usage, being heavily regulated in the same way that construction or flight is.
>That isn't sufficient justification to force these tools into things that require some strict determinism.
By and large, LLMs are not being used for endeavours that strictly require or utilize determinism.
>We've already seen cases of companies being forced to compensate users because their fancy AI toy was unpredictable and gave a user incorrect information:
Regular old human customer service has given me incorrect information before and will continue to do so. If the harm is great enough, it has always been the company responsible for damages.
>That was only customer service with little harm done, hammering these kinds of tools into medicine, defense or law is going to risk even worse consequences
Medicine, law or defense are not areas that require or utilize strict determinism and in fact, incorrect diagnoses and botched operations kill at least thousands of people every year.
Perfect does not exist here so you don't need perfect to improve upon existing methods. If GPT-4 can give more accurate diagnostics than the average doctor then you are only killing more people by not utilizing it as part of the diagnosis process.
>if I have to talk to my phone to get it to do stuff that I used to just be able to do without having to hold a conversation with it, it'd be a clear downgrade.
Ok...where has that happened ? What things do you now have to do with an LLM that you could on your own before?
>The reason Google Assistant and Siri on phones remained mostly just gimmicks
The reason they remained gimmicks is that google assistant and Siri are not competent enough to do anything that isn't strictly hard-coded in.
>Regular old human customer service has given me incorrect information before and will continue to do so. If the harm is great enough, it has always been the company responsible for damages.
>Medicine, law or defense are not areas that require or utilize strict determinism and in fact, incorrect diagnoses and botched operations kill at least thousands of people every year.
>Perfect does not exist here so you don't need perfect to improve upon existing methods. If GPT-4 can give more accurate diagnostics than the average doctor then you are only killing more people by not utilizing it as part of the diagnosis process.
When a human makes a mistake, the human can be found responsible and in the vast majority of cases just having been informed about the mistake will ensure that they don't make that mistake again. Through our empathy we intrinsically understand enough of human intelligence to consider people to be relatively predictable. However the same cannot be said of AI. A current generation medical AI could simultaneously ace exams for getting an MD and confidently lie to patients about basic health knowledge for no discernable reason, on par with intentionally giving an untreated dysfunctional schizophrenic an MD.
See the recent discourse around the use of AI by Israel when it comes to defense, where clearly many people have concerns about using the AI to "morality wash", because all the humans involved can claim to just be following orders. Even if the AI is picking targets more precisely than a human would, if a human messes up, you can sack them, when an AI messes up, you just blame it on the AI, claim to fix it, and then move on to the next time it makes a mistake. You can't really hold the creators of the AI responsible, because they couldn't have predicted the failure mode, and you can't hold the person who executed the fire order responsible because they're just following the orders they've been told to follow and the information they've been told is supposed to hold merit.
>Ok...where has that happened ? What things do you now have to do with an LLM that you could on your own before?
The comment chain started with asking how such "probablistic UI" could be done. It hasn't been done yet, but it isn't exactly a stretch to believe that it'll eventually happen.
We've already had several examples of companies similarly getting rid of "legacy" interfaces to push their "modern" interface. Similarly with UX "designers" and their shitty design languages, eg the way so many websites are optimized for mobile screens, wasting most of the screen space on desktops/laptops.
>The reason they remained gimmicks is that google assistant and Siri are not competent enough to do anything that isn't strictly hard-coded in.
I see that we'll just have to agree to disagree since I, and everyone I know disagrees on this. Even if Google Assistant or Siri were literal human secretaries, they'd be completely useless if they required being within arms reach of a device where you can just type and read. Speech is simply not an efficient enough communication method and if you can type, it's far more efficient to tap a few buttons (which most of us can do even learn to do with little conscious thought) and fill out some fields than to effectively chat with the software.
>When a human makes a mistake, the human can be found responsible
Where corporations or similar entities (law firms, Hospitals, etc, the vast majority of human workers) are concerned, humans are rarely held accountable. Getting fired is not being held accountable. That's just getting rid of a liability and is an option available for AI as well.
If negligence of proper quality processes causes a shipment of products to blow up, then the compensation for a new product, possible medical bills and other damages are almost never paid off by the negligent human(s) but by the corporation he/she worked for.
A corporation can be sued to bankruptcy while it's employees and even CEO and/or founder are financially just fine.
>and in the vast majority of cases just having been informed about the mistake will ensure that they don't make that mistake again.
Perhaps. Perhaps not. There is far from a guarantee.
>However the same cannot be said of AI. A current generation medical AI could simultaneously ace exams for getting an MD and confidently lie to patients about basic health knowledge for no discernable reason, on par with intentionally giving an untreated dysfunctional schizophrenic an MD.
AI can have failure modes different from humans. This is true. This is also something we have dealt with and currently deal with. Animals don't have the same failure modes as humans either. We work with them still because they are useful all the same.
And frankly, this is becoming far less of an issue as LLMs are being scaled up. I do not think GPT-4 is really susceptible to this particular issue you have mentioned.
>Even if Google Assistant or Siri were literal human secretaries, they'd be completely useless if they required being within arms reach of a device where you can just type and read. Speech is simply not an efficient enough communication method.
Secretaries exist still for a reason. If you had one accessible via speech on your phone, you would do far more than what Siri and Google Assistant are capable of.
But sure, everyone will have different utilities for one.
> too unpredictable to trust for anything where a human isn't immediately checking the output.
Not just any human either, but one with enough specific ___domain knowledge to understand subtle failures.
It’s why I strongly recommend against junior developers or anyone new to a language or framework to not use Copilot.
I call this the Babysitter Problem. There’s probably a more official academic term, but in general I don’t any of these vc funded ai startups are asking the right questions when it comes to UX.
Yep, this was a presentation to scientists so concerns about subtle failures came up.
Even when discussing the potential of applying LLMs to help make it easier for non-native speakers to fix grammar errors, there was the major caveat that the output would have to be carefully checked to ensure that it did not subtly change the meaning, which is something even I struggled with on my early papers, despite the benefits of being a native speaker, having natural general intelligence and ___domain knowledge.
He's saying all minds are probabilistic ux. We handle them fine, and built whole worlds on them.
The predictability and determinism of digital system is the detour that is perhaps the flash in the pan. My sense is that the information ecology is about to get MUCH more ecological and fuzzy...
Differential equations are mathematically quite simple, yet they are mostly intractable and the systems they describe chaotic and unpredictable. The mathematics in AI works similarly, being deceptively simple yet giving birth to incredible complexity
I think it's a bit like saying microcode in CPU underpins high-level programming languages - kind of, but good luck understanding what a program does based on its microcode translation. OTOH you can mess with microcode and maybe get better performance, but you won't really know before you try. It's similar with ML - much more like alchemy than science...
Lots of reasons I suppose. Outside script-kiddying with mergekit it takes some serious knowledge to properly train anything, and expensive amounts of compute to actually do it or even run it in the end. It's not the most accesible thing.
For classical methods once you got the algorithm nailed, it will work 100% of the time. For probabilistic methods, you do get better results most of the time but they can also screw up randomly for no reason so their deployability in production is hell on wheels. It's infuriating at times.
Still can't argue against it being very fascinating.
Summary: Fine tune a foundation model for a task. How did all weights change? That's called the "parameter delta." These changes are highly redundant. You can carefully (use DARE to) revert like 80% of them, yet maintain fine-tuned task accuracy! But only if the tuned weights didn't shift much. Otherwise DARE fails. Maybe you can make an LM polymath by melting together many fine-tunes of some base model. No GPU needed.
Image models support merging already. There exists thousands of StableDiffusion models for that reason only. Downside is that almost all models you see are now inbred. Community does talk about it and can see the effect this is having on the image quality. A prominent example is you will see the exact same japanese kind of girl face from almost all models out there when you generate image of a woman. Check out Civitai to see what I am talking about. It's not easy to train new models but super easy to merge.
We can expect an explosion of LLMs using this technique similarly. And later at some point the degradation of quality perhaps. Or may be all those LLMs will be just saying the same things at that point.
Image models merging is kind of meh though. You can't really merge two SD models trained on different styles with different keywords and get a model that knows both independently.
Never done that myself but always thought that was the point of these merges. May be it doesn't understand keywords after merge but I think they do keep the styles in some form. What's the point of merges if that wasn't possible. People sometimes share how much of a model they merged by a factor.
Let's say i trained a model on the artworks of fiona staples that is invoked by typing "fiona staples style". Then i trained a model on the artworks of james daly that is invoked by "james daly style".
What i want when i merge such models is a model that can generate in the artsyle of either fiona and james independently or a mix of both if i specify both keywords in the same prompt.
Currently, if you merge these 2 models and generate "busy city street, fiona staples style", you will not get a model that can generate works in fiona's style, you will just get a model that will generate an odd mix of fiona and daly even if you only specify one of them.
It means you either need to train a million different models for a million different concepts with no chance of cross usage (e.g x person wearing y clothes in z style will not be possible) or train those concepts at the same time, which becomes very cumbersome requiring a retrain of n+1 concepts on a fresh model anytime you want to introduce a new concept.
Oh, and training on x then training on y doesn't work in practice either because the model will mostly forget x learning y.
The weights need to be connected to something interesting. Pre-training is how they get them all connected up, and fine-tuning is how they find which weights it would be useful to change.
Pretraining determines the weights (e.g. the connections), fine-tuning let's you change some subset of the weights (e.g. the final layers) with a smaller chunk of task-specific data.
It would be cool if there was like a threshold in training from scratch, where after this actually works for adding higher level knowledge. So you would start training it like usual to get it to absorb generic langauge skills and reasoning but save all the ___domain knowledge absorption for merging later
Does this imply that some type of decentralized training mechanism is possible? Like an accumulation for ML models. I suspect in the limit
You will just have even more massive models which will require even more demand on the hardware. I also wonder if new capabilities emerge from the merging that are not present in any one model.
The drop and rescale method outlined in the paper makes the latent space increasingly sparse, which in turn allows weights to merged without much interference or degradation.
My instinct is that while merging models will have some use cases, ultimately these insights will lead to innovations in training and architecture that have the same result but with better computational efficiency.
For example instead of training an 8x7b mixture of experts then merging, just incorporate the sparsity constraint while pre-training a single 7b model (somehow).