Hacker News new | past | comments | ask | show | jobs | submit login

Nice breakdown. I agree that data versioning is the one area with limited standardized options. I would add that in addition to versioning the data, there is also the related problem of integrating the 3 areas of versioning... tying the "data version" to the "model version" and the "code version". That seems to me like it might be a good place to start in tackling data versioning, or is that too trivial? Is there a product out there that already does this?



Pachyderm, a project I work on, is probably as close as you'll find to something that ties all 3 together. In my mind the major unsolved problem here was the data versioning so that's the first thing we tackled. Code versioning is already quite well solved so we just integrate with existing tools for that. I'm not convinced that model versioning is actually distinct from data versioning, models are just data after all. So I think without an established system for versioning models, such as Git + Github is for code, treating models as data and versioning them that way is good enough for government work. From what I can tell CometML isn't quite versioning models so much as tracking versions of models. It expects that models to be stored and versioned elsewhere but it gives you a way to get deeper insight into how those models are performing, how they're changing, the hyper-parameters used to train them etc. Tracking this is also a very important problem that CometML seems to solve quite elegantly.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: