Data science has 3 areas for "versioning": 1. code versioning 2. data versioning...

alexilliamson · on April 5, 2018

Nice breakdown. I agree that data versioning is the one area with limited standardized options. I would add that in addition to versioning the data, there is also the related problem of integrating the 3 areas of versioning... tying the "data version" to the "model version" and the "code version". That seems to me like it might be a good place to start in tackling data versioning, or is that too trivial? Is there a product out there that already does this?

jdoliner · on April 5, 2018

Pachyderm, a project I work on, is probably as close as you'll find to something that ties all 3 together. In my mind the major unsolved problem here was the data versioning so that's the first thing we tackled. Code versioning is already quite well solved so we just integrate with existing tools for that. I'm not convinced that model versioning is actually distinct from data versioning, models are just data after all. So I think without an established system for versioning models, such as Git + Github is for code, treating models as data and versioning them that way is good enough for government work. From what I can tell CometML isn't quite versioning models so much as tracking versions of models. It expects that models to be stored and versioned elsewhere but it gives you a way to get deeper insight into how those models are performing, how they're changing, the hyper-parameters used to train them etc. Tracking this is also a very important problem that CometML seems to solve quite elegantly.

davvolun · on April 5, 2018

Interesting. Can you point me to a deeper discussion of this division of "versioning"?

I'm inclined to think something like Django data migrations or EntityFramework Code First Migrations tackles what I immediately thought of as "model versioning" and to some degree "data versioning" (though incomplete or probably impossible, for some things).

gidim · on April 5, 2018

We actually do code and model versioning (and simple data versioning). One thing to keep in mind is that code/results/hyperparams must be coupled. If you have a git branch with some training code and you do not know what the hyperparams/results are then it's not very valuable.

sandGorgon · on April 5, 2018

Could you talk more about a simple data versioning architecture. I have wondered about this coupling problem and would love to hear more.

gidim · on April 5, 2018

I was mostly referring to coupling code with results. For example you have a code that loads a dataset from S3 and then trains a neural network. If you only use git you're likely to lose the hyperparms info (which is often passed as command line arguments) and your metrics/results.

zitterbewegung · on April 5, 2018

This software performs model versioning. https://mitdbg.github.io/modeldb/

nicodjimenez · on April 5, 2018

1. and 3. can be combined with git lfs