1. code versioning
2. data versioning
3. model versioning
Code versioning is primarily dominated by GitHub and is a fairly saturated space (Bitbucket, GitLab). Data versioning is either not happening, or being done through regular data pulls, database snapshots, etc. It is not well standardized or adopted. CometML is tackling model versioning.
It would be really nice to have a single solution for all of these but that is unlikely. Hopefully new standards evolve from this.
Nice breakdown. I agree that data versioning is the one area with limited standardized options. I would add that in addition to versioning the data, there is also the related problem of integrating the 3 areas of versioning... tying the "data version" to the "model version" and the "code version". That seems to me like it might be a good place to start in tackling data versioning, or is that too trivial? Is there a product out there that already does this?
Pachyderm, a project I work on, is probably as close as you'll find to something that ties all 3 together. In my mind the major unsolved problem here was the data versioning so that's the first thing we tackled. Code versioning is already quite well solved so we just integrate with existing tools for that. I'm not convinced that model versioning is actually distinct from data versioning, models are just data after all. So I think without an established system for versioning models, such as Git + Github is for code, treating models as data and versioning them that way is good enough for government work. From what I can tell CometML isn't quite versioning models so much as tracking versions of models. It expects that models to be stored and versioned elsewhere but it gives you a way to get deeper insight into how those models are performing, how they're changing, the hyper-parameters used to train them etc. Tracking this is also a very important problem that CometML seems to solve quite elegantly.
Interesting. Can you point me to a deeper discussion of this division of "versioning"?
I'm inclined to think something like Django data migrations or EntityFramework Code First Migrations tackles what I immediately thought of as "model versioning" and to some degree "data versioning" (though incomplete or probably impossible, for some things).
We actually do code and model versioning (and simple data versioning). One thing to keep in mind is that code/results/hyperparams must be coupled. If you have a git branch with some training code and you do not know what the hyperparams/results are then it's not very valuable.
I was mostly referring to coupling code with results. For example you have a code that loads a dataset from S3 and then trains a neural network. If you only use git you're likely to lose the hyperparms info (which is often passed as command line arguments) and your metrics/results.
1. code versioning 2. data versioning 3. model versioning
Code versioning is primarily dominated by GitHub and is a fairly saturated space (Bitbucket, GitLab). Data versioning is either not happening, or being done through regular data pulls, database snapshots, etc. It is not well standardized or adopted. CometML is tackling model versioning.
It would be really nice to have a single solution for all of these but that is unlikely. Hopefully new standards evolve from this.