> What this article misses though is that despite this, each GPU in the distributed cluster still needs to have enough VRAM to load the entire copy of the model to complete the training process.
That's not exactly accurate. In the data parallel side of techniques, the Distributed Data Parallel (DDP) approach does require a fully copy of the model on each GPU. However there's also Fully Sharded Data Parallel (FSDP) which does not.
Similarly things like tensor parallelism (TP) split the model over GPUs, to the point where full layers are never in a single GPU anymore.
Combining multiple of the above is how huge foundation models are trained. Meta used 4d parallelism (FSDP + TP and pipeline/context parallelism) to train llama 405b.
That's not exactly accurate. In the data parallel side of techniques, the Distributed Data Parallel (DDP) approach does require a fully copy of the model on each GPU. However there's also Fully Sharded Data Parallel (FSDP) which does not.
Similarly things like tensor parallelism (TP) split the model over GPUs, to the point where full layers are never in a single GPU anymore.
Combining multiple of the above is how huge foundation models are trained. Meta used 4d parallelism (FSDP + TP and pipeline/context parallelism) to train llama 405b.