Jamba: Production-grade Mamba-based AI model

smusamashah · on March 28, 2024

There was a recent thread on explaining Mamba https://news.ycombinator.com/item?id=39501982 (https://www.kolaayonrinde.com/blog/2024/02/11/mamba.html)

There was another one on the same thing, probably better https://news.ycombinator.com/item?id=39482428 (https://jackcook.com/2024/02/23/mamba.html)

dang · on March 28, 2024

Thanks! Macroexpanded:

Mamba Explained: The State Space Model Taking On Transformers - https://news.ycombinator.com/item?id=39501982 - Feb 2024 (93 comments)

Mamba: The Easy Way - https://news.ycombinator.com/item?id=39482428 - Feb 2024 (60 comments)

Is Mamba Capable of In-Context Learning? - https://news.ycombinator.com/item?id=39286410 - Feb 2024 (1 comment)

Vision Mamba: Efficient Visual Representation Learning with Bidirectional SSM - https://news.ycombinator.com/item?id=39214939 - Feb 2024 (16 comments)

MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts - https://news.ycombinator.com/item?id=38932350 - Jan 2024 (39 comments)

Implementation of Mamba in one file of PyTorch - https://news.ycombinator.com/item?id=38708730 - Dec 2023 (109 comments)

Show HN: Fortran inference code for the Mamba state space language model - https://news.ycombinator.com/item?id=38687342 - Dec 2023 (1 comment)

Guide to the Mamba architecture that claims to be a replacement for Transformers - https://news.ycombinator.com/item?id=38659238 - Dec 2023 (2 comments)

Mamba outperforms transformers "everywhere we tried" - https://news.ycombinator.com/item?id=38606590 - Dec 2023 (25 comments)

Mamba: Linear-Time Sequence Modeling with Selective State Spaces - https://news.ycombinator.com/item?id=38522428 - Dec 2023 (37 comments)

Mamba: New SSM arch with linear-time scaling that outperforms Transformers - https://news.ycombinator.com/item?id=38520992 - Dec 2023 (2 comments)

garyiskidding · on March 29, 2024

thank you, these are very helpful.

a_wild_dandan · on March 28, 2024

To those curious about the tradeoffs between transformer and state space model layers, I highly recommend Sasha Rush's video on it: https://www.youtube.com/watch?v=dKJEpOtVgXc

az226 · on March 29, 2024

They use less memory for inference but remember the details less well. For instance if you’re implementing code and want edits, it will forget various functions to be part of the script. Even transformers aren’t perfect at this and SSMs are even worse. For many use cases, that ability isn’t needed as much so the memory savings is a bigger lever.

eigenvalue · on March 28, 2024

Has anyone gotten this to work in linux using 1 or 2 4090s? I get stuck on "Loading checkpoint shards: 71%" and then it bails. But weirdly nvidia-smi shows plenty of VRAM available. My machine has 256gb of RAM so I don't think that's the problem either. Really excited to try this one.

Reubend · on March 28, 2024

It's great to see a full production level model using Mamba. But when it comes to long context window benchmarks, I'd love to see performance as well as throughput. I was under the impressions that Mamba has huge increases in throughput at the cost of modest losses in accuracy when using long contexts.

refulgentis · on March 28, 2024

I would too -- long context has been such a red herring across providers, Claude 3 is the first I've seen that seems to genuinely have some sort of qualitative leap in noticing things.

It is worth noting I'm fairly sure there's no inherent theoratical decrease to accuracy in long contexts, the claimed theoratical change is an _increase_ in long-term accuracy in long contexts.

tempusalaria · on March 28, 2024

Every long context sucks right now. All the model providers benchmark on fact recall which is very limited. Actual ability to do anything complicated beyond 16k tokens is not present in any current model I have seen.

ukuina · on March 29, 2024

This is not current. GPT-4-Turbo (128k) has lossless recall to the first 64k input tokens and produces output indistinguishable from GPT-4 (32k), though both are limited to 4k output tokens.

Several downsides: Recall accuracy past the first 64k tokens suffers badly; Cost is astronomical; Response latency is too high for most interactive use-cases.

I would point out the astounding leap in input context in just one year. Should we assume effectively-infinite (RAG-free) context in the near-future?

anoncareer0212 · on March 29, 2024

This is grossly untrue in a way that denotes surface-level familiarity on several fronts

You're referring to the needle-in-a-haystack retrieval problem.

Which the person you're replying to explicitly mentioned is the only benchmark providers are using, for good reason.

Consider the "translate Moby Dick to comedic zoomer" problem. This does not even come remotely close to working unless I do it in maximum chunks of 5,000 tokens.

Consider the API output limit of 4096 tokens, across all providers.

And no, you shouldn't assume effectively infinite (RAG free) context in the near future. This time last year, Anthropic was demonstrating 120,000 token context. It released 200K a few weeks ago. And runtime cost scales with N^2.

binalpatel · on March 28, 2024

Gemini 1.5 Pro is really good at long context in my experience.

neverokay · on April 1, 2024

It’s pretty good at blending the text chunks though, up to a point. It’s like compression, after awhile of passing in chucks your continued summary is too generalized and you lose resolution.

Arthur_ODC · on March 28, 2024

Long Context is great and all, but it sucks that all of these LLM's have really poor output length. If I feed something an entire book and ask for a comprehensive summary then I'm expecting at least a full 3-page summary. I get that they try to force these things to be "concise" to save on compute, but good lord it's so annoying.

pedrovhb · on March 28, 2024

Have you tried asking it for a specific concrete length, like a number of words? I was also frustrated with concise answers when asking for long ones, but I found that the outputs improved significantly if I asked for e.g. 4000 words specifically. Further than that, have it break it down into sections and write X words per section.

Arthur_ODC · on March 28, 2024

Yes, all the possible length extending custom instructions you can think of. I can get some reasonable length responses out of it, but I've never seen them go over 1 page worth, and multi-shot example prompts using multiple USER and GPT exchanges to define the format. Seems like GPT4 has a hard limit as to how much it will output when you click "continue", and Claude Opus never goes over a page either. Another user pointed out using the API, which I have done in the past, but it's been a long while, and I can't really justify the cost of using the advanced models via API for my general use.

refulgentis · on March 28, 2024

Everyone's coalescing at a max of 4096 tokens/12 "pages" via API (page is 250 words, which is 1 8.5"x11" double spaced)

To your point, doesn't matter anyway, it's nigh impossible to get over 2K of output with every trick and bit of guidance you can think of (I got desperate when 16K/48 pages came out to "make it work", even completely deforming tricks like making it number each line and write a reminder on each line that it should write 1000 lines don't work)

CuriouslyC · on March 28, 2024

That's a chat gpt problem, if you hit the API it's not nearly so hard to get good output.

refulgentis · on March 28, 2024

I wouldn't say that, my latest big user story for making sure I'm handling huge inputs was "translate Moby dick to zoomer". Cant give any service chunks larger than ~5K tokens, over API, without it failing.

(Miserably, like, I'd be fine if it gave a paragraph back. But at least on this "map" task, there's a critical point where there's so much input that the reward function ends up imitating the input more instead of chatting)

samus · on March 28, 2024

This one should have you covered :-) one out of every eight layers is a traditional Transformer layer, which should ensure precision, at least over short distances.

swyx · on March 29, 2024

> which should ensure precision, at least over short distances.

why? i dont follow. transformers should provide some attention over -all- distances, no? why does layering truncate this to "short distances"?

samus · on March 29, 2024

I mean "short" in comparison to the unlimited, but lossy recall that the Mamba blocks provide. Transformers are limited to the context length, while Mamba can carry along state. While it can remember things from a lot farther back, it is limited and must thus eventually drop things and/or lose precision.

skybrian · on March 28, 2024

> Jamba boasts an extensive context window of 256K tokens, equivalent to around 210 pages of text, while fitting up to 140K tokens on a single 80GB GPU.

I realize this is a big improvement, but it’s striking how inefficient LLM’s are, that you need 80GB of GPU memory to analyze less than 1 megabyte of data. That’s a lot of bloat! Hopefully there’s a lot of room for algorithmic improvements.

electric_mayhem · on March 28, 2024

It’s literally simulating a neural network.

How much of your 5-sense experiential memories and decades of academic book learning are you bringing to understand my reply to your post?

How many gigabytes do you think that’s equivalent to?

skybrian · on March 28, 2024

Jamba seems to be distributed as 21 5-gigabyte files [1] so I guess that’s another way of looking at it.

[1] https://huggingface.co/ai21labs/Jamba-v0.1/tree/main

imtringued · on March 29, 2024

So what? I have seen models distributed as 26x 10GB files.

richardw · on March 28, 2024

It’s kinda simulating our brains but not really. When I attempted to dig more into how neurons work I realised that it’s a massive chasm of difference. Very much worth doing if you haven’t (you might know far better then me, this is for people who don’t yet.)

In terms of results: Our brains are working with 20w of power and can be trained to compete with LLM’s using a tiny fraction of the world’s data. They also have to keep you breathing and your blood pumping and manage all the dangers of catching a ball near traffic. Or skiing, or poetry, or sunsets. And they remember stuff five minutes later and don’t need a training run that takes months.

We have SO many opportunities to improve the AI architecture it’s ridiculous. This is a good thing.

reissbaker · on March 28, 2024

To be fair most of the brain is more like a pretrained model — it isn't being trained at any point after conception to keep your blood pumping or your lungs working, it does that out of the box roughly as soon as you sprout those organs (or the minute you're born, in the case of lungs). The training process was billions of years of evolution. And, well, given fairly persistent cross-cultural cognitive biases, I expect the conscious thought parts are starting from a pretrained model, too, and all we're doing in school is finetuning ;)

imtringued · on March 29, 2024

People don't understand that to simulate a single neuron, you need an entire neural network. So 70 billion parameters might at best be equivalent to a million neurons but that is assuming that your neural network architecture is akin to the connections between neurons. Considering the physical sparsity, you might need even more parameters to model the connections of a biological neural network. So less than a million neurons in practice.

_false · on March 28, 2024

I love both parent post perspectives on this.

riku_iki · on March 28, 2024

> that you need 80GB of GPU memory to analyze less than 1 megabyte of data

80GB is compressed all human knowledge applied on that 1mb..

pama · on March 29, 2024

The big (huge?) memory requirement is during training. These LLMs work with high dimensional vectors and they calculate gradients with respect to high dimensional vectors and they do updates that require state of the optimizer. If you have 3 particles in 3 dimensions and you need their forces that creates 3 new 3D vectors and once you update their position along the forces then they also carry momenta. Now generalize these simple 3-body physics to the typical 60-layer creatures inside the LLM with vectors of several thousand dimensions, interactions/weights that are scaling like the squares of these vectors, to a total parameter count that adds up to the 10s to 100s of billions of parameters, and then take derivatives and start to keep track of momenta. It is a feat of modern engineering that some groups can train such models efficiently. I hope we will see more of the training stories becoming public in the near future.

nl · on March 29, 2024

This is wrong. You need big memory during inference too.

The difference there is you can use tricks like quantisation and offloading to CPU to reduce it somewhat at the cost of accuracy and/or speed.

pama · on March 30, 2024

Not sure what you mean by wrong. I have never encountered a case yet when training an LLM (no matter what architecture) would require limited memory and was pointing out that the typical memory requirements for training are much higher yet than the typical requirements for inference.

brrrrrm · on March 29, 2024

Training is 3x the memory used by inference, and usually run at a much larger batch size

nl · on March 29, 2024

That’s all the world’s knowledge compressed into 80GB. It’s not analysing 1MB data, it’s analysing all of that knowledge plus and additional 1MB.

nostrowski · on March 28, 2024

Two things I'm curious to know:

1. How many tokens can 'traditional' models (e.g. Mistral's 8x7B) fit on a single 80GB GPU? 2. How does quantization affect the single transformer layer in the stack? What are the performance/accuracy trade-offs that happen when so little of the stack depends on this bottleneck?

patrakov · on March 28, 2024

Mixtral 8x7b runs well (i.e., produces the correct output faster than I can read it) on a modern AMD or Intel laptop without any use of a GPU - provided that you have enough RAM and CPU cores. 32 GB of RAM and 16 hyperthreads are enough with 4-bit quantization if you don't ask too much in terms of context.

P.S. Dell Inspiron 7415 upgraded to 64 GB of RAM here.

imtringued · on March 29, 2024

Compared to the human brain they are shockingly efficient. It's the hardware that isn't, but that is just a matter of time.

gautamcgoel · on March 28, 2024

Why include self-attention layers at all? In other words, why not just alternate SSM and MLP layers?

NLPaep · on March 28, 2024

Mamba is bad with long context. It doesn't remember phone numbers

https://www.harvard.edu/kempner-institute/2024/02/05/repeat-...

a_wild_dandan · on March 28, 2024

Good! DNNs unlock semantics (parsing, transforming, producing). That's the basis of general intelligence, not encyclopedic random string recall. Models shouldn't burn ungodly quantities of compute emulating DDR5 with their working memory. We need machines that think better, not memorize well. We already have plenty of those.

Massive context windows, and their needle tests, are misguided. We won't reach human-level AGI by basically inventing a natural language RDBMS. Our resources should primarily target better reasoning systems for our models, reinforcement learning, etc.

If we can build a GPT4-level problem solving system that coincidentally also can't remember telephone numbers, I'll consider it major progress.

6gvONxR4sf7o · on March 28, 2024

Memorization usually refers to training data. It's often useful to have something that can utilize instructions losslessly, which is the distinction between these models.

Rodeoclash · on March 28, 2024

I can't remember phone numbers either but I can use a device suited to remembering them to look them up

orra · on March 28, 2024

Hell, it looks like you forgot you already said that (-:

Rodeoclash · on March 28, 2024

Haha, I blame the Harmonic app :/

imtringued · on March 29, 2024

What if your field of vision was infinite and you are looking at a unrolled telephone book?

Would you need a device to remember the phone number? You wouldn't. You would need a method or algorithm to find the number, but there is no reason why that algorithm couldn't be part of the attention mechanism. The attention mechanism is akin to reading the entire phone book for every word you are about to say. It would be unreasonable to expect you to not find the right phone number eventually.

Rodeoclash · on March 28, 2024

I can't remember phone numbers either but I can use a device suited to remembering them to look them up.

google234123 · on March 28, 2024

I’m pretty sure computational chemists were combining NNs with Kalman Filters for a while now… I recall the issue it was slow due to the N^2 size of the covariance matrix

uoaei · on March 28, 2024

Surprised they hadn't found ways to advance their techniques with e.g. low-rank approximations, etc.

theGnuMe · on March 29, 2024

That’s one strategy. Also flash attention.

unraveller · on March 28, 2024

Jamba-v0.1-hybrid-MoE (16x6B?) is like giving a big NOS boost to a mixtral 8x7B tier LLM. If true 256k context, 3x longer, faster & cheaper than anything else, it should mean an end to the One Model To Rule Them All mindset for now. The big boys will have to offer some version of it as separate but close side-kick integration to their hero offering.

ninjahatori · on March 28, 2024

On a side note: working over longer contexts also reminds me of MemGPT(https://github.com/cpacker/MemGPT) I think a similar concept can be applied to Mamba architecture models too.

zelphirkalt · on March 29, 2024

Is there a Sparabo too?

It is always funny to see old names associated with totally different new things!

toddmorey · on March 29, 2024

Released with open weights!

CGamesPlay · on March 29, 2024

Does this mean that I can continue a chat without needing to send a full transcript? This feels like it could make inference a lot cheaper for multi-step dialogs.

haddr · on March 28, 2024

Will it be possible to run such model family in ollama?

andy99 · on March 28, 2024

Mamba is supported in llama.cpp so should be (edit - apparently it's not strictly the mamba architecture, it's a mix of mamba and transformers, so it looks like it would have to be ported to llama.cpp)

kjkjadksj · on March 29, 2024

People need to pick better names. Mamba is already a popular python package and internet search tools are on their knees already.

moneycantbuy · on March 28, 2024

would a 192GB RAM mac studio or even a 7950x with 192GB RAM be practical for running this model for inference and possibly fine tuning? Especially if I don't need very low latency e.g. 1 token per second is fine for inference. i also have two 3090s.

lhl · on March 30, 2024

llama.cpp probably won't be getting Jamba support anytime soon: https://github.com/ggerganov/llama.cpp/issues/6372#issuecomm...

There is an MLX Mamba implementation, but nothing for Jamba either: https://github.com/alxndrTL/mamba.py/tree/main/mlx

You could run PyTorch on CPU and w/ a 12B activation pass, it might even run relatively fast (8 tok/s?), but a q4 quant would also easily fit on 2x3090s and should run at >60 tok/s.

kelseyfrog · on March 28, 2024

I'm glad we're seeing exploration into scaling post-transformer LLM architectures, but I'm disappointed that it has a context window. That was kind of the selling point of Mamba(and SSM models in general), right linear scaling because state+input=next_state+output?

spxneo · on March 28, 2024

256k is huge dude. that is like 1/2 of the average non fiction novel

i think at least 200~300 pages of PDF

im not complaining here and it also fits in GPU

refulgentis · on March 28, 2024

I'm not sure I follow fully, it is also the case for (handwaves) "traditional" LLMs that state + input = next state + output. Its just that output increases, so as output becomes input, eventually state + input / next state + output is greater than the context size.

Re: linear scaling, that means the runtime cost is O(n) to context size, rather than traditional transformer O(n^2)

maccam912 · on March 28, 2024

I think kelseyfrog meant that the state for a mamba model is supposed to "remember" stuff even if it doesn't have the actual tokens to reference any more. It might not be guaranteed to hang on to some information about tokens from a long time ago, but at least in theory it's possible, whereas tokens from before a context window in a tradional llms may as well never have existed.

kelseyfrog · on March 28, 2024

Yes, you said it better than I did :)

visarga · on March 28, 2024

That is valid for Mamba, this model (Jamba) is a mix of transformer and mamba layers, so it still has a quadratic memory cost, but divided by 8.

a_wild_dandan · on March 28, 2024

state = context

The difference between SSMs and GPTs here is how that state/context scales. Per usual in engineering, there are big trade offs!

kelseyfrog · on March 28, 2024

I'm not following. State is a multi-dimensional vector and context is a list of tokens. State is perturbed by A and Bx(t), while context is appended to by sampling the predicted token distribution.

zzzzzzzzzz10 · on March 29, 2024

Where can I download and use it?

cs702 · on March 28, 2024

Please link to the original post:

https://www.ai21.com/blog/announcing-jamba

Jamba looks fabulous. Good performance for its size and much more efficient than the available open alternatives.

The key idea: One of out of every eight transformer blocks in Jamba applies dot-product attention with quadratic cost, but the other seven out of eight apply a Mamba layer with linear cost. And the entire model is a mixture of experts(MoE) so only ~12B parameters are used at once for inference.

Thank you to the folks at AI21 for making Jamba available!

swyx · on March 28, 2024

i havent seen anyone mention this yet so i'll be the first - what is the comparison vs StripedHyena? https://www.together.ai/blog/stripedhyena-7b

cs702 · on March 28, 2024

Mamba came out of the same research group, Hazy Research, led by Chris Ré. This new "Jamba" model incorporating Mamba and dot-product attention layers has ~8x more parameters than the largest open Striped Hyena, and appears to work much better.

ipsum2 · on March 28, 2024

@dang this is blogspam for the official post: https://www.ai21.com/blog/announcing-jamba

krasin · on March 28, 2024

The license is a proper open-source one: Apache 2.0. Thanks, AI21 Labs.

popalchemist · on March 28, 2024

In addition to the architectural and performance benefits, this is the big deal here, IMO.

spxneo · on March 28, 2024

im so used to seeing AGPLv3

apache 2 is a more generous license

krasin · on March 28, 2024

AGPLv3 is a fine license too. But most of the models nowadays come with bullshit licenses, like Llama 2 with its "acceptable use policy" enforced by the license: https://ai.meta.com/llama/use-policy/

sleepingreset · on March 28, 2024

god damn

htrp · on March 28, 2024

compute still has cost?

samus · on March 28, 2024

In not sure I understood your question.

This model should have much lower computational cost since only one out of eight layers is a traditional transformer layer with masked self-attention. Additionally, half of the Mamba layers are MoEs.