Hacker News new | past | comments | ask | show | jobs | submit login

Can the inference piece be partitioned over multiple hosts?

Edit: algorithmed or partitioned in a way that overcomes the network bottleneck






> prima.cpp is a distributed implementation of llama.cpp that lets you run 70B-level LLMs on your everyday devices— laptops, desktops, phones, and tablets (GPU or no GPU, it’s all good). With it, you can run QwQ-32B, Qwen 2.5-72B, Llama 3-70B, or DeepSeek R1 70B right from your local home cluster!

https://github.com/Lizonghang/prima.cpp


Pretty sure llama.cpp can already do that

I forgot to clarify dealing with the network bottleneck

Just my two cents from experience, any sufficiently advanced LLM training or inference pipeline eventually figures out that the real bottleneck is the network!



Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: