I think the two can complement each other very well.
GPUs are flexible and scalable when you don't know what the large-scale parameters of the network you want to build look like, and need a lot of them to do training. Let a fleet of cloud-based GPUs do the heavy-lifting of training and learning.
But then once training is over, an FPGA or even an ASIC could implement the trained model and run it at a crazy-fast speed with low-power. A piece of hardware like that would be able to handle things like real-time video processing of a DNN potentially. Very handy for things like self-driving vehicles.
If you're dealing at scales where you can use the word "fleet" then it will usually make sense to just build an ASIC on a trailing process node rather then go for FPGAs. They'll be cheaper in bulk and more performant even with a large process disadvantage.
ADDENDUM: But fundamentally, in spaces like this, the underlying algorithms that can be accelerated are fairly simple. In most cutting edge AI these days the heavy lifting is performed by convolutional neural networks and the specialized silicon that works to speed up one set of convolutional neural network operations will speed up another just as well. Baking the network itself into the hardware shouldn't tend to be any better than loading it into specialized memory pools unless you get really exotic and do your neural network in analog electronics.
I think there is a big enough space between GPU and ASIC technology for FPGAs. The main reason is the lifetime of the models. The shorter that lifespan, the more expensive it is to exchange the ASICs. At the very least you have to produce new ASICs every few months, and replace them in special sockets, or even reflow/solder them to new cards.
My assumption is that the ASIC is executing code that changes every month but that it's using instructions and a memory hierarchy geared towards constitutional neural networks. If that stops being true then of course you'd need a different ASIC but then again if that stops being true then there's no guarantee that a GPU or ASIC will do any better than a CPU. You could end up with something like Alpha-Beta pruning where parallelism doesn't make much different. A reasonable chip wont' be able to contain enough transistors to have separate execution resources for each layer. It's going to have to work by loading a layer, convolving it, load the next layer, convolve it, etc. So you'll be able to change your network without changing the ASIC your running it on while still taking advantage of your dedicated ganged operations. The exact size of the network layers can be optimized for in the FPGA version that you can't in the more flexible ASIC version. But I expect the benefit from those to be much smaller than moving to an ASIC.
From the papers I do believe they are hardcoding the layer weights in the hardware definition of the FPGA. These FPGAs also have no significant RAM on chip, but the intel FPGAs they use do seem to have an even larger number of LUTs than the usual embedded FPGAs, and even dedicated floating point units.
At the very least they talk about omitting weights which are 0 in the synthesis.
TrueNorth was built to run spiking neural networks, which have little to do with deep learning (even though they managed to get it to run a small convolutional NN), and Nervana has never actually built any hardware.
Yes, there are at least a dozen companies with specialized hardware accelerators in some stage of development. For smaller parts some of the existing DSP companies like CEVA and Cadence Tensilica are also are adapting their architectures for deep neural net workloads.
Yet it's still not clear if building a custom chip makes sense, because the next Nvidia chip might make it obsolete. Or the one after the next (still too soon).
GPUs are flexible and scalable when you don't know what the large-scale parameters of the network you want to build look like, and need a lot of them to do training. Let a fleet of cloud-based GPUs do the heavy-lifting of training and learning.
But then once training is over, an FPGA or even an ASIC could implement the trained model and run it at a crazy-fast speed with low-power. A piece of hardware like that would be able to handle things like real-time video processing of a DNN potentially. Very handy for things like self-driving vehicles.