> Because using copyrighted material to train a LLM is largely in the legal grey...

samtho · on July 9, 2023

While there may be some, the most notable ones seem to hide behind the veil of “proprietary training data” but assuming the data is open, the method to generate the model must also be reproducible, thus the toolchain need to be open too. I don’t think there is a lot of incentive to do this.

ronsor · on July 9, 2023

But GPU-based training of models is inherently non-deterministic

Dylan16807 · on July 10, 2023

In what way?

If you keep your ordering consistent, and seed any random numbers you need, what's left to be a problem?

Imnimo · on July 10, 2023

"Inherently" might be too strong of a word, but the default implementations of a lot of key operations are nondeterministic on GPU. With the parallel nature of GPU compute, you can often do things faster if you're willing to be a bit loosey-goosey. PyTorch and TF will typically provide deterministic alternatives, but those come at a cost of efficiency, and might be impractical for LLM training runs that are already massively expensive.

https://pytorch.org/docs/stable/notes/randomness.html

Dylan16807 · on July 10, 2023

I wonder what the actual speed difference is. I couldn't find any benchmarks.

rlupi · on July 10, 2023

The inherent faults or even just speed differences[] of hardware.

[] In the real world, a lot of resources are oversubscribed and not deterministic. Just think about how scheduling and power management work in a processor. Large model training happens across thousands to millions of processors (think about all the threads in a GPU * the number of GPUs needed, and add the power throttling that modern computing does to fit their power envelopes at all level... and power is just one dimension, memory and network bandwidth are others sources of randomness too).

Making such a training deterministic means going as slow as the slowest link in the chain, or having massive redundancies.

I suppose we might be able to solve this eventually, perhaps with innovations in the area of reversible computing (to cancel out undeterminism post-facto), but the current flavor of deep-learning training algorithms can't.

Dylan16807 · on July 11, 2023

There's no reason that has to affect the determinism. When you're calculating thousands of nodes in a layer, each one is an independent series of multiplies and additions, and it doesn't matter what order the nodes get scheduled in. And each one will calculate in the order you coded it.

And if you want something finer, with smaller "slowest links", you can deterministically split each node into a couple dozen pieces that you then add together in a fixed order, and that would have negligible overhead.

rlupi · on July 14, 2023

I am talking about training models on thousands of machines, each with thousands of GPU streaming processors.

For data parallelism, if you want deterministic results, you need to merge weights (AllReduce, in the general case) in a deterministic way. So, either you need a way to wait until they all catch up to the same progress (go as slow as the weakest link), or fix differences due to data skew afterward. AFAIK, no one has developed reversible computation in DL in a way that allow fixing the data skew post-facto in the general case. (1)

For model parallelism, you are bound by the other graph nodes that computation depends on.

This problem can be seen in large-scale reinforcement learning or simulation, or other active learning scenarios, where exploring the unknown environment/data at different speeds can skew the learning. A simple example: imaging a VR world where the pace at which you can generate experiences depends on the amount of objects in the scene, and that there are parts of the world that are computationally expensive but provide few rewards to sustain explorations (deserts) before an agent can reach a reward-rich area; (without "countermeasures") it is less likely that agents will be able to reach the reward-rich area if there are other venues of exploration, even if the global optimum solution lies there.

(1) IMHO, finding a solution to this problem that doesn't depend on storing or recomputing gradients is equivalent to finding a training algorithm that can work in presence of skewed/unhomogeneous datasets for the forward-forward approach https://www.cs.toronto.edu/~hinton/FFA13.pdf that Geoffrey Hinton proposed.