To those curious about the tradeoffs between transformer and state space model layers, I highly recommend Sasha Rush's video on it: https://www.youtube.com/watch?v=dKJEpOtVgXc
They use less memory for inference but remember the details less well. For instance if you’re implementing code and want edits, it will forget various functions to be part of the script. Even transformers aren’t perfect at this and SSMs are even worse. For many use cases, that ability isn’t needed as much so the memory savings is a bigger lever.
Has anyone gotten this to work in linux using 1 or 2 4090s? I get stuck on "Loading checkpoint shards: 71%" and then it bails. But weirdly nvidia-smi shows plenty of VRAM available. My machine has 256gb of RAM so I don't think that's the problem either. Really excited to try this one.
It's great to see a full production level model using Mamba. But when it comes to long context window benchmarks, I'd love to see performance as well as throughput. I was under the impressions that Mamba has huge increases in throughput at the cost of modest losses in accuracy when using long contexts.
I would too -- long context has been such a red herring across providers, Claude 3 is the first I've seen that seems to genuinely have some sort of qualitative leap in noticing things.
It is worth noting I'm fairly sure there's no inherent theoratical decrease to accuracy in long contexts, the claimed theoratical change is an _increase_ in long-term accuracy in long contexts.
Every long context sucks right now. All the model providers benchmark on fact recall which is very limited. Actual ability to do anything complicated beyond 16k tokens is not present in any current model I have seen.
This is not current. GPT-4-Turbo (128k) has lossless recall to the first 64k input tokens and produces output indistinguishable from GPT-4 (32k), though both are limited to 4k output tokens.
Several downsides: Recall accuracy past the first 64k tokens suffers badly; Cost is astronomical; Response latency is too high for most interactive use-cases.
I would point out the astounding leap in input context in just one year. Should we assume effectively-infinite (RAG-free) context in the near-future?
This is grossly untrue in a way that denotes surface-level familiarity on several fronts
You're referring to the needle-in-a-haystack retrieval problem.
Which the person you're replying to explicitly mentioned is the only benchmark providers are using, for good reason.
Consider the "translate Moby Dick to comedic zoomer" problem. This does not even come remotely close to working unless I do it in maximum chunks of 5,000 tokens.
Consider the API output limit of 4096 tokens, across all providers.
And no, you shouldn't assume effectively infinite (RAG free) context in the near future. This time last year, Anthropic was demonstrating 120,000 token context. It released 200K a few weeks ago. And runtime cost scales with N^2.
It’s pretty good at blending the text chunks though, up to a point. It’s like compression, after awhile of passing in chucks your continued summary is too generalized and you lose resolution.
Long Context is great and all, but it sucks that all of these LLM's have really poor output length. If I feed something an entire book and ask for a comprehensive summary then I'm expecting at least a full 3-page summary. I get that they try to force these things to be "concise" to save on compute, but good lord it's so annoying.
Have you tried asking it for a specific concrete length, like a number of words? I was also frustrated with concise answers when asking for long ones, but I found that the outputs improved significantly if I asked for e.g. 4000 words specifically. Further than that, have it break it down into sections and write X words per section.
Yes, all the possible length extending custom instructions you can think of. I can get some reasonable length responses out of it, but I've never seen them go over 1 page worth, and multi-shot example prompts using multiple USER and GPT exchanges to define the format. Seems like GPT4 has a hard limit as to how much it will output when you click "continue", and Claude Opus never goes over a page either. Another user pointed out using the API, which I have done in the past, but it's been a long while, and I can't really justify the cost of using the advanced models via API for my general use.
Everyone's coalescing at a max of 4096 tokens/12 "pages" via API (page is 250 words, which is 1 8.5"x11" double spaced)
To your point, doesn't matter anyway, it's nigh impossible to get over 2K of output with every trick and bit of guidance you can think of (I got desperate when 16K/48 pages came out to "make it work", even completely deforming tricks like making it number each line and write a reminder on each line that it should write 1000 lines don't work)
I wouldn't say that, my latest big user story for making sure I'm handling huge inputs was "translate Moby dick to zoomer". Cant give any service chunks larger than ~5K tokens, over API, without it failing.
(Miserably, like, I'd be fine if it gave a paragraph back. But at least on this "map" task, there's a critical point where there's so much input that the reward function ends up imitating the input more instead of chatting)
This one should have you covered :-) one out of every eight layers is a traditional Transformer layer, which should ensure precision, at least over short distances.
I mean "short" in comparison to the unlimited, but lossy recall that the Mamba blocks provide. Transformers are limited to the context length, while Mamba can carry along state. While it can remember things from a lot farther back, it is limited and must thus eventually drop things and/or lose precision.
> Jamba boasts an extensive context window of 256K tokens, equivalent to around 210 pages of text, while fitting up to 140K tokens on a single 80GB GPU.
I realize this is a big improvement, but it’s striking how inefficient LLM’s are, that you need 80GB of GPU memory to analyze less than 1 megabyte of data. That’s a lot of bloat! Hopefully there’s a lot of room for algorithmic improvements.
It’s kinda simulating our brains but not really. When I attempted to dig more into how neurons work I realised that it’s a massive chasm of difference. Very much worth doing if you haven’t (you might know far better then me, this is for people who don’t yet.)
In terms of results:
Our brains are working with 20w of power and can be trained to compete with LLM’s using a tiny fraction of the world’s data. They also have to keep you breathing and your blood pumping and manage all the dangers of catching a ball near traffic. Or skiing, or poetry, or sunsets. And they remember stuff five minutes later and don’t need a training run that takes months.
We have SO many opportunities to improve the AI architecture it’s ridiculous. This is a good thing.
To be fair most of the brain is more like a pretrained model — it isn't being trained at any point after conception to keep your blood pumping or your lungs working, it does that out of the box roughly as soon as you sprout those organs (or the minute you're born, in the case of lungs). The training process was billions of years of evolution. And, well, given fairly persistent cross-cultural cognitive biases, I expect the conscious thought parts are starting from a pretrained model, too, and all we're doing in school is finetuning ;)
People don't understand that to simulate a single neuron, you need an entire neural network. So 70 billion parameters might at best be equivalent to a million neurons but that is assuming that your neural network architecture is akin to the connections between neurons. Considering the physical sparsity, you might need even more parameters to model the connections of a biological neural network. So less than a million neurons in practice.
The big (huge?) memory requirement is during training. These LLMs work with high dimensional vectors and they calculate gradients with respect to high dimensional vectors and they do updates that require state of the optimizer. If you have 3 particles in 3 dimensions and you need their forces that creates 3 new 3D vectors and once you update their position along the forces then they also carry momenta. Now generalize these simple 3-body physics to the typical 60-layer creatures inside the LLM with vectors of several thousand dimensions, interactions/weights that are scaling like the squares of these vectors, to a total parameter count that adds up to the 10s to 100s of billions of parameters, and then take derivatives and start to keep track of momenta. It is a feat of modern engineering that some groups can train such models efficiently. I hope we will see more of the training stories becoming public in the near future.
Not sure what you mean by wrong. I have never encountered a case yet when training an LLM (no matter what architecture) would require limited memory and was pointing out that the typical memory requirements for training are much higher yet than the typical requirements for inference.
1. How many tokens can 'traditional' models (e.g. Mistral's 8x7B) fit on a single 80GB GPU?
2. How does quantization affect the single transformer layer in the stack? What are the performance/accuracy trade-offs that happen when so little of the stack depends on this bottleneck?
Mixtral 8x7b runs well (i.e., produces the correct output faster than I can read it) on a modern AMD or Intel laptop without any use of a GPU - provided that you have enough RAM and CPU cores. 32 GB of RAM and 16 hyperthreads are enough with 4-bit quantization if you don't ask too much in terms of context.
P.S. Dell Inspiron 7415 upgraded to 64 GB of RAM here.
Good! DNNs unlock semantics (parsing, transforming, producing). That's the basis of general intelligence, not encyclopedic random string recall. Models shouldn't burn ungodly quantities of compute emulating DDR5 with their working memory. We need machines that think better, not memorize well. We already have plenty of those.
Massive context windows, and their needle tests, are misguided. We won't reach human-level AGI by basically inventing a natural language RDBMS. Our resources should primarily target better reasoning systems for our models, reinforcement learning, etc.
If we can build a GPT4-level problem solving system that coincidentally also can't remember telephone numbers, I'll consider it major progress.
Memorization usually refers to training data. It's often useful to have something that can utilize instructions losslessly, which is the distinction between these models.
What if your field of vision was infinite and you are looking at a unrolled telephone book?
Would you need a device to remember the phone number? You wouldn't. You would need a method or algorithm to find the number, but there is no reason why that algorithm couldn't be part of the attention mechanism. The attention mechanism is akin to reading the entire phone book for every word you are about to say. It would be unreasonable to expect you to not find the right phone number eventually.
I’m pretty sure computational chemists were combining NNs with Kalman Filters for a while now… I recall the issue it was slow due to the N^2 size of the covariance matrix
Jamba-v0.1-hybrid-MoE (16x6B?) is like giving a big NOS boost to a mixtral 8x7B tier LLM. If true 256k context, 3x longer, faster & cheaper than anything else, it should mean an end to the One Model To Rule Them All mindset for now. The big boys will have to offer some version of it as separate but close side-kick integration to their hero offering.
On a side note: working over longer contexts also reminds me of MemGPT(https://github.com/cpacker/MemGPT)
I think a similar concept can be applied to Mamba architecture models too.
Does this mean that I can continue a chat without needing to send a full transcript? This feels like it could make inference a lot cheaper for multi-step dialogs.
Mamba is supported in llama.cpp so should be (edit - apparently it's not strictly the mamba architecture, it's a mix of mamba and transformers, so it looks like it would have to be ported to llama.cpp)
would a 192GB RAM mac studio or even a 7950x with 192GB RAM be practical for running this model for inference and possibly fine tuning? Especially if I don't need very low latency e.g. 1 token per second is fine for inference. i also have two 3090s.
You could run PyTorch on CPU and w/ a 12B activation pass, it might even run relatively fast (8 tok/s?), but a q4 quant would also easily fit on 2x3090s and should run at >60 tok/s.
I'm glad we're seeing exploration into scaling post-transformer LLM architectures, but I'm disappointed that it has a context window. That was kind of the selling point of Mamba(and SSM models in general), right linear scaling because state+input=next_state+output?
I'm not sure I follow fully, it is also the case for (handwaves) "traditional" LLMs that state + input = next state + output. Its just that output increases, so as output becomes input, eventually state + input / next state + output is greater than the context size.
Re: linear scaling, that means the runtime cost is O(n) to context size, rather than traditional transformer O(n^2)
I think kelseyfrog meant that the state for a mamba model is supposed to "remember" stuff even if it doesn't have the actual tokens to reference any more. It might not be guaranteed to hang on to some information about tokens from a long time ago, but at least in theory it's possible, whereas tokens from before a context window in a tradional llms may as well never have existed.
I'm not following. State is a multi-dimensional vector and context is a list of tokens. State is perturbed by A and Bx(t), while context is appended to by sampling the predicted token distribution.
Jamba looks fabulous. Good performance for its size and much more efficient than the available open alternatives.
The key idea: One of out of every eight transformer blocks in Jamba applies dot-product attention with quadratic cost, but the other seven out of eight apply a Mamba layer with linear cost. And the entire model is a mixture of experts(MoE) so only ~12B parameters are used at once for inference.
Thank you to the folks at AI21 for making Jamba available!
Mamba came out of the same research group, Hazy Research, led by Chris Ré. This new "Jamba" model incorporating Mamba and dot-product attention layers has ~8x more parameters than the largest open Striped Hyena, and appears to work much better.
AGPLv3 is a fine license too. But most of the models nowadays come with bullshit licenses, like Llama 2 with its "acceptable use policy" enforced by the license: https://ai.meta.com/llama/use-policy/
This model should have much lower computational cost since only one out of eight layers is a traditional transformer layer with masked self-attention. Additionally, half of the Mamba layers are MoEs.
There was another one on the same thing, probably better https://news.ycombinator.com/item?id=39482428 (https://jackcook.com/2024/02/23/mamba.html)