Hacker News new | past | comments | ask | show | jobs | submit login

This bottleneck right here is why Open Source is presented with a golden plate opportunity to lead the training of cutting edge models.

Federated learning breaks the barrier to entry and expands the ecosystem allowing more participants to share compute and/or datasets for small players to train models.

DiLoCo introduced by Douillard minimizes communication overhead by averaging weight updates. What this article misses though is that despite this, each GPU in the distributed cluster still needs to have enough VRAM to load the entire copy of the model to complete the training process. That's where DisTrO comes in which even reduces further the inter-GPU communication using a decoupling technique (DeMo) that only shares the fast moving parts of the optimizer across the GPU cluster.

>And what if the costs could drop further still? The dream for developers pursuing truly decentralised ai is to drop the need for purpose-built training chips entirely. Measured in teraflops, a count of how many operations a chip can do in a second, one of Nvidia’s most capable chips is roughly as powerful as 300 or so top-end iPhones. But there are a lot more iPhones in the world than gpus. What if they (and other consumer computers) could all be put to work, churning through training runs while their owners sleep?"

This aligns with DisTrO techniques because, according to them it could also allow consumer devices like Desktop Gaming PCs to join the compute cluster and share workloads. Besides there's also an open-source implementation called exo that allows models to be split among idle local devices but it's only limited to inference.

Again might still be relevant since in the article it mentions that DiLoCo was able to make the model respond better when faced with instruction prompts or reasoning questions never encountered during pre-training. And Arthur seems to think test-time training will make his approach become the norm.

sources: DisTrO: https://github.com/NousResearch/DisTrO DeMo: https://arxiv.org/pdf/2411.19870 Exo: https://github.com/exo-explore/exo




> What this article misses though is that despite this, each GPU in the distributed cluster still needs to have enough VRAM to load the entire copy of the model to complete the training process.

That's not exactly accurate. In the data parallel side of techniques, the Distributed Data Parallel (DDP) approach does require a fully copy of the model on each GPU. However there's also Fully Sharded Data Parallel (FSDP) which does not.

Similarly things like tensor parallelism (TP) split the model over GPUs, to the point where full layers are never in a single GPU anymore.

Combining multiple of the above is how huge foundation models are trained. Meta used 4d parallelism (FSDP + TP and pipeline/context parallelism) to train llama 405b.


You're right. My caveat not exactly accurate but I wanted to point out where DisTrO might comes in and why it's relevant here.

I mean it reduces the communication overhead by more orders than DiLoCo.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: