Hacker News new | past | comments | ask | show | jobs | submit login

> What this article misses though is that despite this, each GPU in the distributed cluster still needs to have enough VRAM to load the entire copy of the model to complete the training process.

That's not exactly accurate. In the data parallel side of techniques, the Distributed Data Parallel (DDP) approach does require a fully copy of the model on each GPU. However there's also Fully Sharded Data Parallel (FSDP) which does not.

Similarly things like tensor parallelism (TP) split the model over GPUs, to the point where full layers are never in a single GPU anymore.

Combining multiple of the above is how huge foundation models are trained. Meta used 4d parallelism (FSDP + TP and pipeline/context parallelism) to train llama 405b.




You're right. My caveat not exactly accurate but I wanted to point out where DisTrO might comes in and why it's relevant here.

I mean it reduces the communication overhead by more orders than DiLoCo.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: