I'm actually bit surprised at people using PMML and this architecture. Clearly, here the attempt is to isolate model generation and runtime prediction but doing this also confines you to least common denominator. This means you can't generate any model that Opescoring can't handle. If you think about it, there is no real need for Opescoring. You can wip up REST service very easily that wraps sk-learn predictor and I would bet it's actually much easier to do than writing PMML exporters. Then you can use all the goodness of top of the line models white your service interface still remains same. The architecture that enforces you to use lowest common denominator just for abstraction purposes is a poor design, IMO.
PMML allows you to use models generated by a variety of tools and systems, including external vendors. While you are right that making it core does create some constraints, it also allows them to easily replace portions of the model building pipeline with relative ease.
Yes, PMML isn't perfect (being kind), but it continues to be extended and is the one shared lingua franca we have across model creation systems, short of (sigh) SAS code and "recode the model in generic C", both of which I see too often.
I suspect in the future we'll see "standard" architecture with pipelines with multiple parallel feeds and runtime engines into ensembles, each of which allows various model types in "native" format (sklearn and other pythonics, R, java, etc.) which would be interesting, instead of having to cram all into PMML. Just a thought.
Agreed, this describes our reasoning pretty well. Additionally, as we mentioned towards the end of the post, the incremental benefit of a model not supported by PMML is unlikely to be more significant than the incremental improvements we see from investing in improving our features and ground truth. As such, our lowest common denominator isn't actually the model, but rather the features and data pipelines.
The suggestion sytelus makes is a perfectly good way of doing it though, and we will likely change our approach when we find that our models do become the lowest common denominator, or have a higher relative ROI for the time we invest in improving them.
> You can wip up REST service very easily that wraps sk-learn predictor and I would bet it's actually much easier to do than writing PMML exporters.
So as it turns out I spend my days building the very product you're describing (yhathq.com; a REST API-ifier for R and Python). The scikit-learn community alone are a wonderful group who do a hell of a job. It's kinda crazy that most products won't let you use that awesomeness and instead choose to build out their own machine learning libraries to work within their system.
This article got passed around the office this morning and it seems to encompass the general theme of most ML tools. They empower you to do cool things with machine learning/general data analysis, but at the expense of being able to use the libraries that most people use to do machine learning/general data analysis. Don't know if I'd consider that poor design, but yeah, it's definitely a tradeoff.
Hmm, maybe I should be reaching out to airbnb's data science team?
I've just built a system very much like this for a large customer. Extremely interesting and I learned a lot while doing it. Funny to see companies operating at a similar scale running into similar problems and solving them in roughly similar ways.
Looks like a cool project... but I hope the plan to open source their library to export Scikit-Learn classifiers to PMML! This would be a great way for them to give back to the open source community.
We'd love to do this. As you can imagine, however, it's a very time consuming task, and there are a lot of competing priorities (including other projects we've open sourced) and we therefore can't make any guarantees about if/when we'll be able to do it.
Nice writeup. It seems like a supervised learning approach to fraud detection. I have a question: Where does the is_fraud variable come? Is it done by humans?
(I don't work for Airbnb, but work in a similar space)
Yes, this variable is usually set after the fact. For instance, a given transaction may have led to a chargeback, or may be done by a known fraudster. These models are usually trained on historical data, so we can know with some certainty which transactions are fraud.
It could be that a transaction was fraudulent but has not yet led to a chargeback (maybe the real cardholder hasn't yet seen their statement?), so there's still some uncertainty, but hopefully that approaches a minimum after some time passes.
I don't want to hijack this thread too much from the original post, but we use some of the same software as Airbnb (scikit-learn, randomforest models, etc.) as well as some stuff developed in-house. Credit card fraud has been one of our biggest issues, and we've developed some pretty robust systems to fight it. Contact me privately and I'd be happy to talk about it -- this stuff is what I do for a living.
Thanks! It varies depending on the model, but in most cases, yes, the confirmation of fraudulent activity is manual since having the ground truth be correct is critical
OK good to know. :) Would you be writing another blog post on what features you use for fraudulent activity, or do you consider this a business secret?
No, features are definitely a business secret. My goal was to share as much as possible that might be helpful to others combating fraud, without giving anything away that would hinder the effectiveness of our systems.
Don't mean to hijack the comment thread, but can anyone recommend any good videos that introduce machine learning or courses? I studied computer science, but was not able to take any classes on the subject. I found the Stanford one, anyone have experience with it?
I've heard nothing but good reviews of the online version of the class. I took the class at Stanford (and actually worked on the system mentioned in the article) and I found its content to be useful. I believe that the online version contains less theory but this is not necessarily a bad thing if all you want is an introduction.
I've also found reading papers to be illuminating: often the first article about a given classifier is fairly well written and accessible if you have a strong background in math.
I really liked the Caltech course. I didn't spend too much time trying the Stanford one, but I got the sense that this one had more of an overarching theoretical framework that every new concept was explained in terms of.
I was thinking about the sorts of fraud categories AirBnB likely experiences. Most fraudsters want cash or cash equivalents, and the use of lodging on a particular night is nearly as illiquid as stolen fine art. So, those seeking stuff to resell will choose to defraud one of the zillion online marketers who ship stuff to doorsteps. A buyer who actually used the space he reserved could initiate a chargeback later claiming that the service promised via AirBnB wasn't provided -- couldn't access apartment, wasn't as described, etc. However, space providers likely will cooperate with AirBnB and provide evidence in their defense. Better to attempt a chargeback elsewhere if one is short on money. It seems that using AirBnB as a platform for crimes between buyer and space provider is possible, and there certainly has been at least one heavily publicized case, but we would hear a lot more about these events if they were happening much.
So, what's left? Collusion between buyer and space provider -- in all likelihood, they are one in the same, or identities have been stolen. For example, I list my condo on AirBnB for $100/night. Someone books it for the weekend, and then doesn't show up. AirBnB owes me $200 -- after all, I gave up other options to profit from its use. An honest buyer pays up. But, maybe the buyer is dishonest -- he used a stolen credit card, etc. In this case, AirBnB eats the loss and pays me as the space provider. Now, wouldn't it be convenient if I was also the buyer? Cash from stolen credit cards, funneled through AirBnB (much akin to the way online poker sites were used to transfer stolen money via bad heads-up play). This would work until AirBnB noticed that my listing seems to have a suspicious propensity to attract fraudulent buyers. Then, they'll shut me down. So, I'll pop-up elsewhere. After all, no need to actually have a space because no one I accept will ever show up!
I bet the usage patterns of the party/parties involved in this fraud are drastically different than those of legitimate market participants. Someone with a fraudulent listing could out himself by rejecting a bunch of legitimate AirBnB buyers, and this behavior would stand-out as it's the opposite of the behavior expected of an honest seller. So, he must protect against this risk by making his listing unappealing (high price, bad photos/description, unpopular ___location, etc.). The behavior of users browsing AirBnB when viewing this property could identify its relative undesirability (few clicks, etc.), and price outliers could be identified by comparing similar offerings by date/___location/type. The click stream of the "buyer" likely is most revealing. Someone selecting an unappealing property without doing much comparison shopping likely isn't a legit buyer.
What other stuff might predict fraud? Vague descriptions might indicate a fraudulent listing. Most space providers love to tell buyers what's special about their offering. Could some scoring of a listing's prose prove a strong predictor? I've never listed with AirBnB. What do they do to verify listings? As a buyer, they verified my identity. Could this serve multiple purposes? Certainly, I'd feel better listing my guest room if I know that AirBnB will know the identity of the guy who rented the room and then stabbed me at 3AM. But, in addition, does identifying market participants in strong ways help keep fraudsters from repeating their crimes by setting up multiple accounts? Obviously, newer market participants are more risky than established ones, especially those who have interacted with known legit, long-time users. The social graph comes to the rescue here. Even astroturfing ought to show up as a small, disconnected graph unless legit users' identities are stolen.
Of course, this comment is all just conjecture. Obviously, AirBnB can't tell the public about specific fraud methods or how they identify suspicious activity. However, I like the concreteness of considering actual fraud scenarios, so I decided to put forth some ideas for discussion.
I didn't quite understand the need for openscoring and pmml. If it's just a question of using a sklearn model to predict an outcome, why not just build it into a simple json-rpc with Tornado, Gevent or whatever the rage is, currently?
From what I understand they want to persist the trained model in a language independent way, so you can train the models with whatever language or framework you wish and then save it to a format that can be used by any other language or framework to classify unseen instances.
As I'm working on a very similar problem right now, the difficulty is that to save the fitted sklearn model you have to pickle it (pickled decent size random forest are several megabytes). Then, at classification time, you have to import pickle, sklearn (and numpy), depickle the object, run the example through the classifier and extract the output. Perhaps the Openscoring model is more efficient?
You can use `all_model_filenames = joblib.dump(model, filename)` after fit on your dev enviroment. joblib will store each numpy array in the model datastructure as an independent file and `all_model_filenames[0] == filename` refers to the file holding the main pickle structure.
Then on your prediction servers, ensure that you have a copy of `all_model_filenames` in the same folder. You can then load the model with `model = joblib.load(filenames[0], mmap_mode='r')`. This will make it possible to use shared memory (memory mapping) for the model parameters of a large random forest so that all the Gunicorn, Celery or Storm worker processes running on the same server will use the same memory pages, making it a very efficient way to deploy large models on RAM constrained servers.
You can even use docker to ship the model as part of a container and treat the model as binary software configuration.
As I said, run a seperate service for this. That way you only have to load the model (or even train it) once per service process. That is one thing the Openscoring service also does...
If you are more familiar with Python than Java, like me, then that would be a more attractive option.
At <west coast startup, essentially a copy of earlier successful European businesses such as HouseTrip, but with access to stupid amounts of US capital and therefore more profitable>, we <make superfluous, keyword-laden, unverifiable claim about ourselves in the future>. We <continue to integrate feel-good community pronouns>. We <here discuss something only tangential to our core business and assert that we have allocated at least two people to this area>. We <have nothing better to do than write it up, because quite frankly, there's nothing more pressing for us to work on in an already automated business of relative simplicity>.
OK, so that's a bit harsh, but there's some points toward reality in there. Sorry, as someone who used to run a complex travel industry business (3200+ hotel contracts... all of them in Chinese, all business by digital fax (no convenience here!), constant rate changes, in 6 human languages and multiple currencies with a real time call center) and who co-pitched for VC with HouseTrip's management in London in 2009, I just have very little respect for AirBNB.
Care to do a write up of what you're doing as far as fraud prevention goes? However you feel about Airbnb, this was an interesting post. It's not ground breaking or earth shattering, but it shows the tech stack that a large company is using to solve a real problem, and that's useful. They even did a fairly good job of explaining the why.
Care to do a write up of what you're doing as far as fraud prevention goes?
Sure, but only high level. I would hazard a guess that fraud prevention is a lot more complex for us at https://www.kraken.com/ ... dealing with many cryptographic currencies and conventional currencies spread across probably over a hundred legal jurisdictions is not easy. We likely have to consider far more factors than these guys. We have recently added two more quants from programs highly regarded in the conventional finance industry to our team, plus we have over seven figures of investment in legal and training programs in the area. We also use R.
Basically, it's inputs (behavior), processing (metric extraction, risk model), output (boolean choices, statistical cluster membership, etc.)... where a series of such outputs may feed in to a heirarchy of scores for different elements within a system. Some applications may be real time, others after-the-fact.
At a high level, which is mostly where my involvement is in hiring people, fraud prevention is not dissimilar to spam or intrusion detection: you can basically use a combined, constantly tweaked set of inputs to a Bayesian-style scoring algorithm. Inputs include both static rules and statistical anomaly detection.