For those hearing about Cerebras for the first time, they make a chipset that's similar to a GPU in matrix multiplication speed but way bigger (a whole wafer) so it can fit more transistors and memory onto one chip. They achieve this small LOC count because they don't need to shard across multiple devices / backprop consolidate on a central CPU / etc. These tricks are usually what blows up a project from a single architecture proof of concept to a robust training pipeline that can handle the billions of parameters on modern models. This is more akin to training a whole model on a single GPU because... it kind of is.
Even with a wafer scale chipset this approach has limits. You eventually will still need to shard to fit more parameters / use different training modalities / etc. I'd look at this more as a proof of concept for the ergonomics of what LLM training can look like when you have access to a much larger compute primitive versus a new state of the art in feature-equivalent clean code.
This does however feel a bit like that "big data" phenomenon, where most companies deploy ridiculously overcomplicated distributed data clusters, where their actual problems could be handled much more simply, cheaply and efficiently, by a single server with a lots of RAM and a solution somewhere on the spectrum between "bunch of UNIX pipes with standard UNIX text processing tools" and "tuned PostgreSQL" / "tuned in-memory SQLite".
That is: a lot of distributed big data processing tasks don't need to be distributed. Perhaps with beefy enough matrix multiplication chips, a lot of "big ML" tasks won't need to be distributed either.
I'm a big believer in the approach you're laying out too. Bugs are much easier to diagnose, crash reporting is more straightforward, and you don't need augmented services to consolidate everything at the output layer. That said - I've been predicting a shift back to simple architectures for awhile now and they haven't really come to pass. Maybe there's too much pressure or financial incentives for increasing complexity to solve increasing complexity?
At the end of the day I do believe ergonomics are going to win out. I think that's in large part why pytorch won out over tensorflow and jax; it provided the just-in-time computation that would allow people to more easily find bugs & visualize results without having to `compile()` everything down to a static computation graph. Hardware seems like a natural place for that abstraction layer - but maybe the silver bullet will really be on the software side, since we already have too many "low-RAM" equivalent ML devices in the wild. Cheaper to string things together after the fact vs. shipping net new hardware.
The other thing to keep in mind is that Transformers may well be end up being supplanted by more efficient alternatives with different hardware requirements.
For instance, right now there's a new crop of "linear RNNs" (RWKV, Mamba, retnet, etc.) claiming to be as good as Transformers for language modeling but with two advantages: their compute cost is O(n) instead of O(n²), and they don't need to keep past context in memory.
I don't know if these linear RNNs will actually supplant Transformers, but I do think hardware requirements are likely to change over time.
We don't have overcomplicated distributed training infra. They are pretty much as complicated as needed.
Scaling vertically (having a bigger chip) is very hard. There are tons of tradeoff when making the chip, and it's overall an insanely complex problem. That's why Cerebras is a 8 years old company and yet you would be hard pressed to find anyone using them still.
And even if you give me a Cerebras chip that works perfectly, it will still be much easier for me to buy two of those chips and link them together in distributed training mode, than it will be for Cerebras to build a chip that is 2x the size.
The scale of the current generation of clusters to train models the size of GPT-4 are in the range of 25,000+ GPUs with 80GB of memory each, so no matter your chip size, complicated distributed infra is a necessity. Even assuming everything on Cerebras' marketing page is fully accurate, you would still need to distribute the training over 500+ of those massive chips to replicate a 25k GPU cluster.
I'd think that such researcher would've already heard of adages like "you can't get nine women to give birth in one month", or "where there's six cooks, there's nothing to eat".
Or more directly, perhaps one should ask such researcher, "if your team was to double in head count, would you do this project twice as fast?".
8 GPUs do a pretty bang-up job of doing 8 months of compute in a month :).
I think the broader point is that the last x0 years of ML research show that more compute is better, both for iteration speed and for resulting performance. Distribution is just the natural outgrowth of that imperative once it reaches the limit of a single device/node. If Cerebras can address models at today's scale on one device, the immediate next step is "what can N of these devices do together to build models at tomorrow's scale"
I still think work on improving single-core/device performance is worthwhile, as distribution will always strictly not-better, and almost always strictly worse, due to coordination costs reducing efficiency. If two Cerberas can be glued together and achieve roughly 2x of their performance, it's still going to be more efficient than achieving equivalent performance from many more regular GPUs. Getting the hardware fast enough so that you need just one device for your problem - that's a special case that will yield extra win.
Which dot products did you do, I'll do the next one. Oh, that was John's, but he is away on vacation today. Let me take care of John's and tomorrow we have a quick meeting to see which matrix he takes next. Sounds a lot like a bus :)
Agreed and thank you for also posting. I haven't incurred any cloud costs and my home server upgrades have been done for cheap, if you really look. I laughed because my "pipe" is as you put it, "Unix pipes with standard text processing". Hey, it works.
In fairness, it's more a case of companies incorrectly identifying their problem as a "big data" problem, usually due to ego or resume padding. If you genuinely have a "big data" problem, you probably do need distributed data clusters.
Are these chipsets anywhere in the price range that would make them feasible for consumer/pro-sumer? I wasn't able to find anything related to pricing without contacting sales. With the numbers being thrown about, this would appear to be "enterprise-only"
For transformers, especially multi-device training pipelines, yes the codebase can normally be 100ksloc and require a team to do at industry scale. See e.g. Hugginface Transformers, or the Megatron impl they cite.
Cerebras is trying to show how easy it is to on-board single ICs and demo their pytorch integration.
But yeah, where's the wallclock time comparison?! Surely they did one during development, and surely the Sales team knows (or they do once the article was published), yet not even a hint of what their throughput is like. For Cerebras to be this far and not be plastering benchmarks everywhere is a bad sign. Maybe they're going to just die off like Graphcore.
Well thankfully my day job is actually ML research. :)
Lines of code is not strictly speaking a bottleneck for the next generation of models, but it ties with other objectives that are: researcher productivity, hardware efficiency, and model verifiability. GPT-5 might be another case of simply scaling up the existing transformer models, but the next step-function change of model quality are going to involve a lot more R&D about the right architectural primitives to take before scaling it up. And in that case lines of code do matter - because they allow simple concepts to be robustly tested, optimized, and iterated against. Doing that against a 50k monolith is a much harder task.
I don't understand why they're comparing the parameter sizes to lines of code.
AFAIK you can just increase the layer parameters of a 1B model to whatever you want? Like, the difference between a 1B and 175B model can be just changing a few numbers, and not adding any LOC at all?
LOC has never been a limitation for large models, it's been the compute+training data required.
Most of the LOC is spent on optimization, and they don't address MoE or anything fancy like that?
when you go from 1B to 175B, the model no longer fits in memory. so in practice you have to re-factor the model using tensor/pipeline parallelism. that's why it goes from 600 to 20K LOC.
It doesn't look like Cerebras mentioned the most important part, by trading model complexity due to using a vastly more capable system, they could could refactor that 600 line model effortlessly and rerun.
They can watch different layers train and find out how to optimize training or quantization, etc.
It feels like they kinda missed the forest for the trees here. The article should have focused on model architecture optimization due to the small LoC and the system having ridiculous training capacity.
Distributed training infra/libs have made insane progress since the Megatron era.
I have worked with Megatron codebase to train larger than 175B models a few years back, a lot of the boilerplate that you find in those 20k LoC you could remove today by just importing deepspeed or other distributed training libs.
Cerebras' point still stands though, even if you can get the LoC count down significantly nowadays, it's still a major PITA to debug those systems, deal with node crashing, tweak the architecture and the data-loading pipeline to have high GPU utilization, optimize network bottlenecks etc.
Scaling vertically first like Cerebras is doing surely makes that much easier.
On a tangentially related note, this is imho where OpenAI has built it's moat: training and inference stack that they have refined over the last 6 years. They have good researchers, but so does MS, Google and Meta. But no one else has the ability to train such large models with such ease. Same for the inference stack, being able to run GPT-3.5/4 in prod at the scale at which they are doing it is no joke, and I'm 100% convinced this is why Gemini is still not widely available a year after 3.5 came out.
Everyone knows Cerebras by their wafer scale chips. The less understood part is the 12TB of external memory. That's the real reason why large models fit by default and you don't have to chop it up in software ala megatron/deepspeed.
Strange that they don't mention the performance, how long does it take to do one step, and how does it compare to a similarly priced GPU cluster? Sure simple code is good, but it needs to also be useful.
Would it be ever possible to run a GPT-{n}, n>3, similar model in a home computer wihtout GPU? I have a "good" laptop with 32GB, good processor, but no GPU (I was never interested in gaming, crypto or ML), but I found GPT very useful and I'd prefer to run a local version instead of keep feeding OpenAI.
OpenHermes-2.5-Mistral-7B is better than GPT-3 (and scores even better than GPT-3.5-Turbo in human evaluations) and can even run on a raspberry pi or in the browser. On a laptop CPU it uses about 5GB of RAM (in 5bit) and runs around 20-30 tokens per second, which is very fast.
I recommend downloading and running OpenHermes inside LM Studio. https://lmstudio.ai/
In LM Studio, search for OpenHermes. Pick the Q5_K_M version (this is the best quality/speed trade off). Then go to the chat tab.
On the chat tab, set the context length to 4096 (or up to 16k if you want longer context) and set the number of CPU cores you have under "Hardware Settings."
Select the model from the drop down and start chatting!
Most laptops these days have a pretty sizable GPU on the same chip. IIRC Triton makes proper use of the Intel graphics while AMDs equivalents work well with OpenCL out of the box. Apple's M1-3 architecture saw some major speedups on llama.cpp etc. as well. Worth noting is that some may need special drivers, my Xeons from 2010 has support for executing OpenCL but needed extra drivers; no comment on modern processors.
Looks like hardware vs. software abstraction. Considereing the perspective of an LLM startup: Would you rather write 20k LOC of complex code that would make you be able to more easily switch hardware platforms - or - write 600 LOC of less complex code and be pinned to a single provider?
What I'm most interested in with abstraction is how easy it is to change something that doesn't fit neatly inside of the abstraction framework. It looks like the model is pretty flexible as it's just plain pytorch, I couldn't immediately tell about other aspects of the training - for example they have their own optimizer, what if I want to change something?
There are lots of "just one line of python" type frameworks that are fine if you want to do the one thing in the demo but are more complicated than just writing it yourself if you have to change something.
We do have reference optimizers implemented for use on our system and available in the `cerebras_pytorch` package, but this isn't because those are the only ones supported; instead, no vanilla pytorch optimizer is currently `torch.compile()` compatible. The main difference is that we pre-initialize the optimizer state instead of doing it lazily in the first `step()`
Stay far away from HuggingFace if you can. Battle-hardened if you only do the absolute simplest and boring stuff. Look at the number of open issues and skim through a few source code files and you’ll understand.
Like LangChain, they were at the right place at the right time. That doesn’t make them good.
In the context of this blog post (reimplementing nanoGPT as-is without shenanigans, although training is a separate issue), transformers is straightforward enough. I do agree that going beyond the documentation demos does make things more complicated, but IMO still less so than other implementations speaking from experience. AI is still sometimes necessarily complex.
It's certainly an order of magnitude easier to use something like transformers or diffusers than the original implementations provided by the original model trainers, and has a few good optimizations out of the box.
That's different from LangChain which is complex for the sake of being complex.
Good to hear that I'm not the only one thinking that. I read a lot of their code for one of my pet projects and I thought that maybe it's me that doesn't get it because I don't write that much python.
We are talking about very big models which training requires an enormous amount of hardware. Not sure how scalable HuggingFace transformer for training such models is.
HF is okay for off the shelf prototypes and very quick, multi-hour hacks, and I greatly appreciate their contributions to the community, but unfortunately in my (and sadly, many other people's experiences), their code is a continuum of devilish nightmares for anything beyond that.
HF has been pretty good for training in my experience. Both accelerate, deepspeed and other improvements are available and the ecosystem is vibrant. One thing you might want to look into is to write your own training loop rather than using Autotrainer, but that's if you want to get you hands dirty and get finer control over how you call layers, your loss function etc.
Inference is where it falls short really, and solutions like vLLM are much much faster.
Ignorant question: Why are we interested in training models much smaller than GPT-4? For academic reasons? I understand training in specific domains but isn’t that covered by fine tuning, with much less compute?
I'm curious if these low-code models matter. I understand that small codebases can be cached effectively speeding up computations, but isn't it the data load the bottleneck in training?
Furthermore, how important is the breadth of data in the dataset to getting the desired results? I was under the impression that the main reason these LLM work is based on massive data sets.
As such, is there data-breadth metrics to validate whether training on a given dataset is even worthwhile? (ie: avoid sunk cost on a dataset that will yield a poorly performing LLM)
nanoGPT & micrograd are master pieces of code. Truly god level code.
Andrew Karpathy is truly a gem and super grateful he still publishes videos showing his art.
Cerebras showing their distributed architecture on that same piece of code is impressive.
All of AI is search for a god algorithm. An algorithm so simple it could be written on an A4 piece of paper in 12px font size - but with enough data and compute it can more intelligent than entire cities of humans combined.
I would’ve been interested to learn how much it costs to train these models using their platform. Like, a 70b model - are we talking millions of dollars here?
I think the point is that LOC is not a terribly useful metric in that everything is 1 LOC at the highest level of abstraction. The business proposition here is that you don't need to write the LOCs for the underlying layers, they do it. The pitch here is that it's not as straightforward for large GPU clusters.
Sure, I get that. I've definitely seen demos of "Do X in Y LoC" that do X but offload all of the hard work of Y to some libray. This is not that. This is intended to be a demo that shows you what you can do with one Cerebras module. And, the result is that, by writing 565 LoC yourself, you can train and run an LLM the size of GPT-3.
In that sense, 565 LoC is a perfectly fair number. It doesn't count PyTorch, numpy, the Python interpreter, or any of the library modules that are imported, but I don't think anyone was touting it as anything more; for instance, Mo Gawdat has said that GPT-4 is probably ~4500 LoC. And, yes, that certainly involves much more infrastructure, and doing that dance of going from GPU to CPU to a completely other node, etc.
Just to add on, the project parallels Nano-gpt that itself touts it's small loc. It also uses pytorch etc. In both cases, the actual model logic is in the quoted lines of code. So the comparison is apt for what it's recreating. (I don't know how fair the comparison to other loc figures mentioned is).
Yes, transformers are very simple - but typically the additional lines of code are doing useful work. The comparison with nvidia megatron is particularly ridiculous imo.
I don't see the novelty/interesting bit in this article, personally.
Even with a wafer scale chipset this approach has limits. You eventually will still need to shard to fit more parameters / use different training modalities / etc. I'd look at this more as a proof of concept for the ergonomics of what LLM training can look like when you have access to a much larger compute primitive versus a new state of the art in feature-equivalent clean code.
Disclaimer: I'm a small investor in Cerebras.