Can the inference piece be partitioned over multiple hosts? Edit: algorithmed or...

Maxious · 2025-05-04T15:19:19 1746371959

> prima.cpp is a distributed implementation of llama.cpp that lets you run 70B-level LLMs on your everyday devices— laptops, desktops, phones, and tablets (GPU or no GPU, it’s all good). With it, you can run QwQ-32B, Qwen 2.5-72B, Llama 3-70B, or DeepSeek R1 70B right from your local home cluster!

https://github.com/Lizonghang/prima.cpp

happyPersonR · 2025-05-04T14:29:01 1746368941

Pretty sure llama.cpp can already do that

TYMorningCoffee · 2025-05-04T16:11:25 1746375085

I forgot to clarify dealing with the network bottleneck

moralestapia · 2025-05-05T00:27:11 1746404831

Just my two cents from experience, any sufficiently advanced LLM training or inference pipeline eventually figures out that the real bottleneck is the network!