Deep Learning with Spark and TensorFlow

vonnik · on Jan 25, 2016

So the cool thing here is that you can use Spark and TF to find the best model like Microsoft Research did with Resnets.

http://www.wired.com/2016/01/microsoft-neural-net-shows-deep...

They're showing you how to train different architectures simultaneously, and then compare their results in order to select the best one. That's great as far as it goes.

The drawback is that with this schema, you can't actually train a given network faster, which is what you want to do with Spark. What is the role of a distributed run-time in training artificial neural networks? It's easy. NNs are computationally intensive, so you want to share the work over many machines.

Spark can help you orchestrate that through data parallelism, parameter averaging and iterative reduce, which we do with Deeplearning4j.

http://deeplearning4j.org/spark https://github.com/deeplearning4j/dl4j-spark-cdh5-examples

Data parallelism is an approach Google uses to train neural networks on tons of data quickly. The idea is that you shard your data to a lot of equivalent models, have each of the models train on a separate machine, and then average their parameters. That works, it's fast, and it's how Spark can help you do deep learning better.

hcrisp · on Jan 25, 2016

Impressive, but it seems an inversion of paradigms. Small data to compute ratios is usually associated with high performance computing (HPC). Why use Spark when the data is small and is broadcast to each worker? You have to pay the serialization-deserialization penalties of moving the data from Python to JVM and back again. In fact the JVM isn't really needed here at all since all the computation is done in the pure-Python workers in an embarrassingly-parallel way. Seems to me that you would just move onto an HPC and use TensorFlow within a IPython.parallel paradigm and be done much sooner.

rxin · on Jan 25, 2016

The "broadcast" is pretty cheap because often you already have the data in some distributed file system, or if on a single node the network bandwidth is pretty high. The problem with a lot of the deep learning workloads is that it is very compute intensive and as a result takes a long time to run. For example, it is not uncommon to take a week to train some models.

gcr · on Jan 25, 2016

Deep learning workloads are typically compute-intensive, but they also tend to be extremely I/O intensive, and convergence may depend on a synchronous step where all the nodes must finish making their contribution to the model before any of them can continue. (This may not be quite true though -- see Google's DistBelief paper--but most frameworks work this way). Often times, adding more machines to a cluster may make training proportionally slower.

rxin · on Jan 25, 2016

Did you actually read the article? It was using Spark to parallelize hyperparameter tuning, which is embarrassingly parallel.

doobwa · on Jan 25, 2016

Why not just use GNU Parallel (or something similar) instead of Spark?

elyase · on Jan 25, 2016

I think this could have been done with GNU parallel. One advantage I see with Spark is that is that it is easier to interact with Python, for example these two lines are all is needed to call the relevant Python function:

  urls = sc.parallelize(batched_data)
  labelled_images = urls.flatMap(apply_batch)

So if you already have a cluster with Spark installed (like Databrick does) then it takes less work to just call your Python code than setting up a GNU Parallel cluster and a writing a small wrapper script. Additionally a Python script would have to load/init the models on every call from Parallel. I agree that this is not a great demonstration of Spark main strengths.

orm · on Jan 25, 2016

I think one reason would fault tolerance. Is there a fault tolerance layer in GNU parallel? last time I checked their homepage ( a few minutes ago), there was no reference to fault tolerance.

Another reason is, perhaps, scheduling.

chimtim · on Jan 26, 2016

what fault tolerance does spark give you in this scheme? It cannot look into TF progress and checkpoint all state. Using Spark with TF, seems like an overkill -- you need to manage and install two framework what should ideally be a 200 line python wrapper or small mesos framework at most.

ole_tange · on Jan 26, 2016

Does --retries count as fault tolerance?

gcr · on Jan 26, 2016

Oh dear. You're right, sorry. Shouldn't have commented before actually reading the article...

amelius · on Jan 25, 2016

I have a question about neural networks.

Say, you are training a NN to recognize handwritten characters 0 and 1, and you have 1000 training images for each character (so 2000 images in total). All images are bitmaps with 0 for black and 1 for white.

Now, by accident, all the "0" training-images have an even number of black pixels, and all the "1" training-images have an odd number of black pixels.

How do you know that the NN really learns to recognize 0's and 1's, as opposed to recognizing whether the number of pixels in an image is even or odd?

mindcrime · on Jan 25, 2016

I would say that if you're using a single layer NN, the answer is "you don't really know". And that gets to a point about how we still don't entirely understand how neural networks work, even when they do work.

If you were using a deep network though, and if the current theory is correct, it would be a slightly different story. The current thinking, as I understand it, is that with deep networks, each layer learns representations of certain features (say "slashes", "edges", "right slanted lines", "left slanted lines", etc.) and the progressively higher layers learn representations composed from those more primitive features. So if a deep net were recognizing your handwritten characters, you could probably reason that it isn't just considering whether the number of black pixels is even or odd.

Now in reality this is a pretty contrived, and probably unlikely scenario. But it's a valid question, because there's a deeper point to all of this, which involves transference of learning. That is, how do you take the learning done by a neural network - trained to do one thing - and then leverage that learning in another application. We still don't exactly know how to do that, and that's in part because we don't entirely understand the nature of the representations the networks build up. So a very good answer to your question would arguably help understand how to do transference, which would make NN's even more useful.

argonaut · on Jan 26, 2016

The net is just going to learn the representation "even/odd number of black pixels," if that's the easiest thing to learn.

It also goes without saying that 2k images is probably not going to be enough data to learn any meaningfully general feature representation.

JoeAltmaier · on Jan 25, 2016

They famously do things like that all the time - instead of recognizing a tank in the woods, they notice all the tanks-in-woods pictures were taken on a sunny day; they train to recognize a sunny day.

rm999 · on Jan 26, 2016

These kinds of things can happen surprisingly easily in the real-world. The most common cause is "target leaks", which happen when the thing you are trying to predict accidentally ends up in your dataset, usually indirectly through some non-obvious process. Neural networks are especially good learners, and will perform suspiciously well in these situations.

There's often no way to know exactly what a neural network is doing, but sanity checks can catch most issues. Realistically, you wouldn't expect a neural network to perform with 100% accuracy, which would be a first clue in your example.

narrator · on Jan 26, 2016

This is more or less a case of overfitting. The algorithm works on the training set but doesn't generalize well. Tweaking the algorithm by decreasing the number of nodes in the hidden layer and doing cross-validation usually can help with this.

muizelaar · on Jan 25, 2016

You don't necessarily know, but the unlikeliness of this occurring in the training images and the structure of the neural net making it difficult for the net to learn to even or oddness of the number of pixels makes one reasonably confident.

trhway · on Jan 26, 2016

http://joi.ito.com/images2/mathjoke.gif

Homunculiheaded · on Jan 25, 2016

There's actually a case in the early history of perceptrons that brings up this exact issue:

"There is a humorous story from the early days of machine learning about a network that was supposed to be trained to recognize tanks hidden in forest regions. The network was trained on a large set of photographs – some with tanks and some without tanks. After learning was complete the system appeared to work well when “shown” additional photographs from the original set. As a final test, a new group of photos were taken to see if the network could recognize tanks in a slightly different setting. The results were extremely disappointing. No one was sure why the network failed on this new group of photos. Eventually, someone noticed that in the original set of photos the network had been trained on, all of the photos with tanks had been taken on a cloudy day, while all of the photos without tanks were taken on a sunny day. The network had not learned to detect the difference between scenes with tanks and without tanks, it had instead learned to distinguish photos taken on cloudy days from photos taken on sunny days!"[0]

The pragmatic answer is that this is why you have two hold-out sets: cross validation/dev set and the test set. Typically you keep 70% of the data for training, 15% of the data for CV and 15% for Test. Ideally you should shuffle the data enough that there isn't any bias in the natural order of the data.

You train the model on the train data, and estimate how well the model actually performs on the CV set which the model did not see in training. You continue to use the CV set while you tweak parameters, try out new models etc. At this point you may have "cheated" a bit because you only kept things that worked well on your CV data. Finally when you say "this is done!" you try out your model on the Test data set.

Of course it's still possible that you would have the even/odd issue, and the answer to this whole set of issues is "healthy skepticism", and checking for these types of errors.

Take for example this Sentence Completion Challenge from Microsoft Research [1]

They claim some astounding results on correctly predicting GRE type questions using a very simple model (LSA for those who care). These results seemed impossible! But it turns out they cheated by training the model only on possible answers (which is akin to studying for the actually GRE by only review the possible answers that will be on the exam).

We tend to obsess over p-values and test validation scores as a substitute for reasoning. But all research papers should be read as an argument a friend is making to you, "I've done this incredible thing... ", and no single number should replace reasoned inquisition into possible errors.

[0] http://watson.latech.edu/WatsonRebootTest/ch14s2p4.html

[1] http://research.microsoft.com/apps/pubs/?id=157031

rahimiali · on Jan 26, 2016

the tank anecdote is also famously apocryphal. here's a good analysis of the origin of that story: http://www.jefftk.com/p/detecting-tanks

argonaut · on Jan 26, 2016

I'm pretty sure that story is an urban legend. Nobody can find the original source.

conjectures · on Jan 26, 2016

Because you can simulate from the network. If you were right, this couldn't happen:

http://www.cs.toronto.edu/~hinton/adi/index.htm

tachim · on Jan 26, 2016

0.1% accuracy increments correspond to 10 images in the testing set; they should be reporting standard error bars with those numbers.

elcct · on Jan 25, 2016

That article reminded me of this: http://i.imgur.com/XQJ3ACO.jpg

obituary_latte · on Jan 25, 2016

http://i.imgur.com/boZRjbB.png

mindcrime · on Jan 26, 2016

There's actually a little bit more info out there for would-be "Watson builders".

https://www.ibm.com/developerworks/community/blogs/InsideSys...

http://www.theregister.co.uk/2011/02/21/ibm_watson_qa_system...

http://learning.acm.org/webinar/lally.cfm

http://www.cs.nmsu.edu/ALP/2011/03/natural-language-processi...

Of course, there's still a big gap between "Download some stuff" and "Build Watson", but at least there's a trickle of details on what happens in the "a miracle happens here" step. :-)

obituary_latte · on Jan 26, 2016

Yup - and very grateful for those.

To me, recently, the linked graphic represents pretty well what I'm faced with on a daily basis. People seem to think that because a hammer exists, it's easy to build a house.

rxin · on Jan 25, 2016

The blog post actually provides code to reproduce all the steps and the chart. See

http://go.databricks.com/hubfs/notebooks/TensorFlow/Distribu...

http://go.databricks.com/hubfs/notebooks/TensorFlow/Test_dis...

You might've missed the section "How do I use it?" Maybe we should've made that section more obvious.

jb1991 · on Jan 25, 2016

I am still laughing from that graphic. So simple, but, you know what? So true, too.