I'm actually bit surprised at people using PMML and this architecture. Clearly, ...

mwexler · on June 17, 2014

PMML allows you to use models generated by a variety of tools and systems, including external vendors. While you are right that making it core does create some constraints, it also allows them to easily replace portions of the model building pipeline with relative ease.

Yes, PMML isn't perfect (being kind), but it continues to be extended and is the one shared lingua franca we have across model creation systems, short of (sigh) SAS code and "recode the model in generic C", both of which I see too often.

I suspect in the future we'll see "standard" architecture with pipelines with multiple parallel feeds and runtime engines into ensembles, each of which allows various model types in "native" format (sklearn and other pythonics, R, java, etc.) which would be interesting, instead of having to cram all into PMML. Just a thought.

zephyrnh · on June 17, 2014

Agreed, this describes our reasoning pretty well. Additionally, as we mentioned towards the end of the post, the incremental benefit of a model not supported by PMML is unlikely to be more significant than the incremental improvements we see from investing in improving our features and ground truth. As such, our lowest common denominator isn't actually the model, but rather the features and data pipelines. The suggestion sytelus makes is a perfectly good way of doing it though, and we will likely change our approach when we find that our models do become the lowest common denominator, or have a higher relative ROI for the time we invest in improving them.

ericchiang · on June 17, 2014

> You can wip up REST service very easily that wraps sk-learn predictor and I would bet it's actually much easier to do than writing PMML exporters.

So as it turns out I spend my days building the very product you're describing (yhathq.com; a REST API-ifier for R and Python). The scikit-learn community alone are a wonderful group who do a hell of a job. It's kinda crazy that most products won't let you use that awesomeness and instead choose to build out their own machine learning libraries to work within their system.

This article got passed around the office this morning and it seems to encompass the general theme of most ML tools. They empower you to do cool things with machine learning/general data analysis, but at the expense of being able to use the libraries that most people use to do machine learning/general data analysis. Don't know if I'd consider that poor design, but yeah, it's definitely a tradeoff.

Hmm, maybe I should be reaching out to airbnb's data science team?