This bottleneck right here is why Open Source is presented with a golden plate o...

lambda-research · 2025-01-14T13:48:11 1736862491

> What this article misses though is that despite this, each GPU in the distributed cluster still needs to have enough VRAM to load the entire copy of the model to complete the training process.

That's not exactly accurate. In the data parallel side of techniques, the Distributed Data Parallel (DDP) approach does require a fully copy of the model on each GPU. However there's also Fully Sharded Data Parallel (FSDP) which does not.

Similarly things like tensor parallelism (TP) split the model over GPUs, to the point where full layers are never in a single GPU anymore.

Combining multiple of the above is how huge foundation models are trained. Meta used 4d parallelism (FSDP + TP and pipeline/context parallelism) to train llama 405b.

aimanbenbaha · 2025-01-14T15:24:30 1736868270

You're right. My caveat not exactly accurate but I wanted to point out where DisTrO might comes in and why it's relevant here.

I mean it reduces the communication overhead by more orders than DiLoCo.