The service is pretty cool. The training works by uploading your CSV formatted training data (with labels) to Google Storage. Then you make a call to train the google service. Google has not said much about what kind of algorithms they are using behind the scenes, besides the fact that they are using a combination of a proprietary and open-source ML algorithms. The service trains up a variety of different models and then uses a voting scheme to decide which ones are optimal.
A few problems I see (or saw, I havnt used it in a few months) with the service are the following.
1. currently, there is no way to pick your cross validation folds. this can lead to severe overfitting if your data is not i.i.d
2. they provide a numerical (double) accuracy number which corresponds to the accuracy estimated from training. how is this number calculated (AROCS,etc.). They do not say
3. Security issues - read the fine print of what happens when your data gets uploaded to Google storage. It could be a cause for concern
4. Your are competing for resources. When I was testing the API, I would train two successive models with the same amount of data, and I would notice one call would complete (asynchronously) after 10 seconds, while the next would take 10 minutes. This is because your are competing for resources
5. Currently no way to inject prior knowledge into your models. What if you know your data is Guassian, you could use an RBF kernel, but with this API, you cannot, because it might pick the Naive Bayes Classifier and not the SVM, etc.
In general, this service probably will work for the average SPAM detection problem, but if you really want a great system, you probably need to keep everything in house.
A few problems I see (or saw, I havnt used it in a few months) with the service are the following.
1. currently, there is no way to pick your cross validation folds. this can lead to severe overfitting if your data is not i.i.d
2. they provide a numerical (double) accuracy number which corresponds to the accuracy estimated from training. how is this number calculated (AROCS,etc.). They do not say
3. Security issues - read the fine print of what happens when your data gets uploaded to Google storage. It could be a cause for concern
4. Your are competing for resources. When I was testing the API, I would train two successive models with the same amount of data, and I would notice one call would complete (asynchronously) after 10 seconds, while the next would take 10 minutes. This is because your are competing for resources
5. Currently no way to inject prior knowledge into your models. What if you know your data is Guassian, you could use an RBF kernel, but with this API, you cannot, because it might pick the Naive Bayes Classifier and not the SVM, etc.
In general, this service probably will work for the average SPAM detection problem, but if you really want a great system, you probably need to keep everything in house.