Hacker News new | past | comments | ask | show | jobs | submit login
IBM Chip Processes Data Similar to the Way Your Brain Does (technologyreview.com)
169 points by finisterre on Aug 7, 2014 | hide | past | favorite | 57 comments



Yann LeCun (neural net pioneer and Facebook AI head) has a somewhat-skeptical post about this chip: https://www.facebook.com/yann.lecun/posts/10152184295832143. His essential points:

1. Building special-purpose hardware for neural nets is a good idea and potentially very useful.

2. The architecture implement by this IBM chip, spike-and-fire, is not the architecture used by the state-of-the-art convolutional networks, engineered by Alex Krizhevsky and others, that have recently been smashing computer vision benchmarks. Those networks allow for neuron outputs to assume continuous values, not just binary on-or-off.

3. It would be possible, though more expensive, to implement a state-of-the-art convnet in hardware similar to what IBM has done here.

Of course, just because no one has shown state-of-the-art results with spike-and-fire neurons doesn't mean that it's impossible! Real biological neurons are spike-and-fire, though this doesn't mean the behavior of a computational spike-and-fire 'neuron' is a reasonable approximation to that of a biological neuron. And even if spike-and-fire networks are definitely worse, maybe there are applications in which the power/budget/required accuracy tradeoffs favor a hardware spike-and-fire network over a continuous convnet. But it would be nice for IBM to provide benchmarks of their system on standard vision tasks, e.g., ImageNet, to clarify what those tradeoffs are.


I find it interesting that no group (to my knowledge) has tried something similar to [Do Deep Networks Need to Be Deep?](http://arxiv.org/abs/1312.6184) for ImageNet scale networks. There have been several results which show that the knowledge learned in larger networks can be compressed and approximated using small or even single layer nets. Extreme learning machines (ELM) can be seen as another aspect of this. There have also been interesting results in the "kernelization" of convnets [from Julian Mairal and co.](http://arxiv.org/abs/1406.3332) that, accompanied by the stong crossover between Gaussian processes and neural networks from back in late 90s, point to the possibility of needing different "representation power" for learning vs. predicting which may lead to the ability to kernelize the knowledge of a trained net, ideally in closed form.

I am doing some experiments in this area, and would encourage anyone thinking of doing hardware to look at this aspect before investing the R&D to do hardware! If this knowledge can really be compressed it could be a massive reduction in complexity to implement in hardware...

I am a bit biased on this topic (finishing a talk about this exact topic for EuroScipy now) but I find the connections interesting at least.


But over a specific time-period, doesn't spike-and-fire integrate signals, so that effectively you're operating with real-valued quantities? Isn't this the brain's way of using digital signals (more robust, lower power) than analogue values over the neural wires?


For a technical article about the architecture , see :

http://www.research.ibm.com/software/IBMResearch/multimedia/...


I'm very excited about this, as it's at least 2 decades overdue. When Pentiums were getting popular in the mid 90s, I remember thinking that their deep pipelines for branch prediction and large on-chip caches meant that fabs were encountering difficulties with Moore's law and it was time to move to multicore.

At the time, functional programming was not exactly mainstream and many of the concurrency concepts we take for granted today from web programming were just research. So of course nobody listened to ranters like me and the world plowed its resources into GPUs and other limited use cases.

My take is that artificial general intelligence (AGI) has always been a hardware problem (which really means a cost problem) because the enormous wastefulness of chips today can’t be overcome with more-of-the-same thinking. Somewhere we forgot that, no, it doesn’t take a billion transistors to make an ALU, and no matter how many billion more you add, it’s just not going to go any faster. Why are we doing this to ourselves when we have SO much chip area available now and could scale performance linearly with cost? A picture is worth a thousand words:

http://www.extremetech.com/wp-content/uploads/2014/08/IBM_Sy...

I can understand how skeptics might think this will be difficult to program etc, but what these new designs are really offering is reprogrammable hardware. Sure, we only have ideas now about what network topologies could saturate a chip like this, but just watch, very soon we’ll see some wizbang stuff that throws the network out altogether and uses content addressable storage or some other hash-based scheme so we can get back to thinking about data, relationships and transformations.

What’s really exciting to me is that this chip will eventually become a coprocessor and networks of these will be connected very cheaply, each specializing in what are often thought of as difficult tasks. Computers are about to become orders of magnitude smarter because we can begin throwing big dumb programs at them like genetic algorithms and study the way that solutions evolve. Whole swaths of computer science have been ignored simply due to their inefficiencies, but soon that just won’t matter anymore.


I remember thinking that their deep pipelines for branch prediction and large on-chip caches meant that fabs were encountering difficulties with Moore's law

It's really a combination of memory latency and pipelining.

Memory latency is absolutely terrible compared to processor speed, and that has nothing to do with Moore's law. It's 60ns to access main memory, which is ballpark 150 cycles. If you have no caches, your 2.5Ghz processor is basically throttled to 16Mhz. You can buy some back with high memory bandwidth and a buffer (read many instructions at a time). But if you have no predictor, every taken branch flushes the buffer and costs an extra 150 cycles- in heavily branched code your performance approaches 8Mhz.

Then think about pipelining. We don't pipeline because Moore's law has ended. We pipeline because a two-stage pipeline is 200% as fast as an otherwise identical unpipleined chip. A sixteen-stage pipeline is 1600% as fast. Why the hell wouldn't you pipeline? Now, of course in the real world branched code can tank a deep pipeline. Which is where the branch predictor comes in, buying back performance.

http://stackoverflow.com/questions/4087280/approximate-cost-...


>>> If you have no caches, your 2.5Ghz processor is basically throttled to 16Mhz.

No. This is only true if every instruction tries to access memory.

>>> We pipeline because a two-stage pipeline is 200% as fast as an otherwise identical unpipleined chip. A sixteen-stage pipeline is 1600% as fast.

No. First of all, each stage in the pipeline will be equal to the slowest stage. Second, there will be significant overhead of passing data through pipeline registers, and of control logic for those registers.

The reason we saw 32 stage pipelines in P4 was mostly marketing: "megaherz race" between AMD and Intel.


You are right, there is appreciable overhead in pipelining and the benefit is not quite as powerful as I claimed. I am guilty of an age-old crime, simplifying a complex subject for the layman and skipping real details in the process.

But you can be certain that AMD and Intel do not design 20+ stage pipelines for some measly 10% performance uplift. The overhead of the pipeline infrastructure is nowhere near the performance gain. Consider Haswell has an IPC around 2 instructions per cycle. With a ~20 stage pipeline, they are indeed far outstripping the performance of "Haswell minus pipelining".

As for the super-deep pipeline in the P4, the consensus I hear is that Intel expected frequency to keep scaling, and as such the P4 was a future-looking architecture designed to scale to 10GHz and beyond.


>>> No. This is only true if every instruction tries to access memory.

Every instruction must be loaded from memory in order to execute it. Hence instruction caches.


Yes, you're right, I missed that.


I remember the first time the von Neumann architecture was laid out for me and me thinking "woah that's bottlenecked" and immediately thinking it would make more sense to do the computation where the memory was, or replace "memory" with just a huge pile of registers or something other than what I was looking at.

This is really exciting stuff, I can't help but think a marriage of this approach with HP's memristor technology would bring us screaming along an amazing architecture path for the next several decades.

But then again, I'm concerned that the limited use cases for this being presented are basically already performed by various custom (and cheap and power efficient) DSPs. Is all that's really being envisioned here just a lower power alternative to DSPs? I think the vision can be much bolder.


Of course it would make sense to do the computation where the memory is. Trouble is the memory area is dramatically larger than the computation area.

Imagine if you were a reference librarian, asked for facts like some kind of ancient Google. Suppose your library was the size of your bedroom- you could very quickly find facts. You only have to cross the room. Now suppose you are right in the middle of the Library of Congress. You are smack dab in the middle of it- you are where the memory is. But you're still going to spend half your time just running about the building due to its sheer size!

The only ways to solve that problem are:

- Make memory smaller. Engineers have been hard at work at this for decades.

- Use less memory. This is slower.

- Use a memory hierarchy. This is what we do today, and is analogous to you sitting in a bedroom-sized library with the Library of Congress just down the street, and a young courier who fetches you books from it.

The other challenge is speed. We can't have a huge pile of registers because fast memory is less-dense than slow memory. So 1KB of CPU registers occupies a lot more space than 1KB of DRAM- but DRAM is a poor choice for registers because of how slow it is.


You are forgetting

- use more CPUs

With a billion perfectly cooperating (that's the research problem) librarians, searching the library of congress is way faster.


>>> Why are we doing this to ourselves when we have SO much chip area available now and could scale performance linearly with cost?

Multicore performance doesn't scale linearly because 1) adding more cores has rapidly diminishing returns on performance for most problems (http://en.wikipedia.org/wiki/Amdahl's_law) and 2) the cost of coherency is exponential with the number of cores.


I wonder whether you're dismissing today's GPUs too quickly. The way they work (and are programmed) today is about half way between CPUs and the linked architecture. They're generally applicable for an order of 10E3 parallel computations, with 10E5-10E6 threads they can be saturated. Whether you have computationally bounded or memory bandwidth bounded algorithms doesn't really matter (both is faster on GPU), what matters is a sufficiently long runtime for the parallelizable part of an application as well as not too much branching for computationally bounded kernels (there is a point where CPUs become faster when there's too many branches since for example on Kepler architecture for each branch, the neighbouring 192 cores are locked together).

The philosophy of today's GPU architecture is basically quite simple: Maximize memory throughput by using the fastest RAM that's still cheap enough for consumers, then maximize die space for the ALUs by letting bundles of them share scheduler, register blocks and cache. I was first very skeptical about this too, but to my experience it has proven quite effective - even parallel algorithms that are not ideal for this architecture still profit from the raw power, and they continue getting benefits when you buy new cards, in a fashion that's much closer to Moore's law than CPUs develop.

The architecture certainly isn't ideal and would be solved by an architecture like in your link (to which Parallela also comes quite close btw), and I can well imagine that this is where we're heading given another 5-10 years (see Parallela, to some extent Knight's Landing). However it's also feasible that the GPU's ALU maximisation game will win out, especially once 3D-stacked memory comes into play.

Since 2008 there have been many papers about NNs implemented on GPU and I'd love to know what's the current status there, especially compared to the very powerful Power8 architecture.


While the efficiency gains are nice and definitely welcome, it would be interesting to see what the performance gains are over a GPU. The article makes the chip sound somehow superior to existing implementations but really this is just running the same neural network algorithms we know and love on top of a more optimized hardware architecture.

Meaning I have no idea how this signals the beginning of a new era of more intelligent computers as the chip provides nothing to advance the state of the art on this front. Unless I am missing something?


I think the more optimized hardware makes a big difference, mainly in power consumption. We're talking power savings of like 99.9 percent here, which makes embedded stuff way more powerful and reduces the need for calling out to the cloud for processing, thin-client device OSes, etc.


I too would like to see a decent comparison. Also what's ultimately going to matter is up-front cost per synapse and ongoing cost per synapse-second (from power consumption). That's really all that matters if you are planning to make a cluster out of them.

Of course, there could be some new and interesting uses in embedded devices where sheer throughput doesn't matter so much as total power usage, for moderate processing power. For example, the AI in Roomba and similar robot vacuums is pretty rudimentary, so appliances like that could maybe get a boost from this.


From the article "When running the traffic video recognition demo, it consumed just 63 milliwatts of power. Server chips with similar numbers of transistors consume tens of watts of power" and "laptop that had been programed to do the same task processed the footage 100 times slower than real time, and it consumed 100,000 times as much power as the IBM chip" So if those statements are true I would say it is about 10,000 to 100,000+ more energy efficient. That is a rather large claim so we would need to see more proof...


A difference is that a GPU uses a lot of power and takes up a lot of space. I can imagine an optimized, energy-efficient chip would be useful in embedded systems. Something like a Raspberry Pi for image processing maybe?


Like Tegra K1? GPUs are more energy efficient than normal CPUs for some tasks, so getting lower absolute power consumption is just a matter of using fewer cores.


I think you're thinking of a Graphics Card, a GPU is comparable in physical dimensions to a CPU.


I wonder what the possibilities are for adding a neuromorphic chip to a normal stack for specialized tasks such as the image/video recognition (cpu, gpu, npu). GPUs are very similar in their need for specialized code vs cpus.

Just an uneducated wild-thought.


This is something I'm interested in discovering as well. I view most of these developments as modular components that could be used in conjunction with existing processor pipelines. For instance, with these 'neural' chips, I could imagine an existing processor querying the neural chip to look for particular activation patterns. Though I'm not too sure on the language one would use to specify which patterns to look for... Perhaps you could extract the parameters from the neural chip itself through a learning process, which you'd then use to bootstrap the process a bit and know what to look for? I'd imagine a lot of formal research is still needed here.

Neat developments, excited to see how they shake out.


One possibility is to use the neuromorphic chips as souped-up branch predictors -- instead of predicting one bit, as in a branch predictor, predict all bits relevant for speculative execution. This can effect large-scale automatic parallelization.

See this paper at ASPLOS '14 for details:

http://hips.seas.harvard.edu/content/asc-automatically-scala...


That is definitely be something plausible. From what I've seen, there's a lot of work at the moment in trying to write languages and toolkits to automatically target hetrogenous platforms - which this could be slotted into nicely.


The interesting thing about this project is that they're using transistors to physically simulate synapses and neurons, which is quite an inefficient method. Transistors are expensive, and your brain has about 100 billion neurons, and trillions of synapses.

A recent discovery by Leon Chua has shown that synapses and neurons can be directly replicated using Memristors [1]. Memristors are passive devices which may be much simpler to build in the scale of neurons compared to transistors.

1. http://iopscience.iop.org/0022-3727/46/9/093001/


Actually this chip is not _that_ far off. It has 5 billion transistors so with a process shrink and a board with 10-20 of these chips it should be roughly equivalent to the number of neurons in a human brain. Now think of your home with about 100 of things connected to your network and your house will be pretty darn smart!


You are quite off here. A transistor is simply not as powerful as a neuron. The article itself notes that the chip is capable of simulating "just over one million \"neurons,\"".


Lots of problems with the way this is presented in the article. Though the chip is patterned after a naive model of the human brain, the headline assertion is far too bold. Additionally, while the Von Neumann architecture can be characterized as bottlenecked and inefficient, it has also allowed for extremely cheap computing. A processor with all of its memory on the chip would not be inexpensive. Note this article never mentions the cost of the chip nor its memory capacity.

The comparison of this chip's performance with that of a nearby traditionally-chipped laptop is questionable. A couple of paragraphs later it says that the chip is programmed using a simulator that runs on a traditional PC. So I'm guessing the 100x slowdown is because the traditional PC is simulating the neural-net hardware, rather than using optimized software of its own.

Yes, this is important research, but engineer-speak piped through hype journalists will always paint an entirely unrealistic and overoptimistic picture of what's really going on.


What percentage of readers know you could fill a football stadium with these chips and for many tasks it wouldn't come close to a human brain with today's knowledge of software? I love news like this just feels like analogies using brains are easy to overhype.


I think work like this is very important. In the 1940s you could fill a football stadium with about 50 ENIAC computers and you wouldn't have 1/1000th the processing power of an Iphone. Your statement gives useful perspective in one direction, but exponential improvement cannot be ignored. There can't be any doubt that neuromorphic chips have a lot of wiggle room to explode in capability in the coming decades.


> have a lot of wiggle room to explode in capability in the coming decades.

Are you sure about that? CPU speeds have not improved in years. We appear to have hit a maximum, at least for now. (Of course I can't predict the future, but it's been years now and no change.)


Clock Frequency != Speed.

Otherwise we would still be using (very cheap) Pentiums IV.

In a way, it's a testament to human ingenuity that CPUs have kept improving they way they have when the brute force way of increasing performance was not as viable as before.


Well we know as fact that a carefully designed system about 5cm wide and 5cm tall can have amazing capacity. Now precisely because CPU speeds are stagnant our architecture is we're way overdue exploring new paths towards this capacity. Although it's hard to tell if it's even achievable in silicon.

Personally I believe digital computation has only a niche applicability in the limit, the degrees of freedom from analog processing are just so much higher, even in the presence of noise.


FLOPS/$ has not plateaued, that I'm aware of.


I think even fewer reader understand that we really, really have very few clues about how the overall data processing of the brain happens.

Indeed, I could even claim that given how little we know, the actual "real processing" happen in the brain wind-up being much less than it seems. But yes, it appears that whatever the brain does is fantabulously more complex than any chip that's even being sketched today.

What this looks like is a chip that does some canned machine learning routines. It seems sad to have to hype a parallel chip of this sort this way. But it would be sad if the chip itself is hard corded for just whatever fake-brain computations its creators thought were right (I've scanned several pages deep for real information on the chip but it comes back hype and more hype). The thing is it's actually possible to build a more general kind of parallel chip - a cellular automaton on chip such as Micro is doing, see: http://www.micron.com/about/innovations/automata-processing.

Also, the Wikipedia page gives the impression this is mostly an exercise in seeing if they can scale chip to neural scale. http://en.wikipedia.org/wiki/SyNAPSE


I totally agree. Neural networks, and neuromorphic computing/hardware neural networks are fascinating topics, and it's great to see a new interest in them. However, the big issue is that overhype is what caused the lull in neural network research until back propagation (~1986), and then again after that until deep learning (2006).

Thus, while the topics are fascinating and these appear to be impressive strides, journalists need to be careful not to hyperbolize.


Although IBM's hardware implementation does not support the current hotness in neural models, I still think that this is a big deal, both for applications with the current chip and also future improvements in even less required energy and smaller and more dense chips.

I was on a DARPA neural network tools advisory panel for a year in the 1980s, developed two commercial neural network products, and used them in several interesting applications. I more or less left the field in the 1990s but I did take Hinton's Coursera class two years ago and it is fun to keep up.


Anyone got something more technical? I googled a bit and I can't seem to find anything beyond marketoid handwaving


Look for papers by Carver Mead from Caltech in the '80s, these are all based off of those concepts I think.


> Anyone got something more technical?

Well I found this little sound-byte from a link in the original article. I didn't find it particularly original.

From [0] > “Programs” are written using special blueprints called corelets. Each corelet specifies the basic functioning of a network of neurosynaptic cores. Individual corelets can be linked into more and more complex structures—nested, Modha says, “like Russian dolls.”

The term 'Russian doll' evoked recursion (and distant memories of my late grandfather), very common even 50 odd years ago.

[0] http://www.technologyreview.com/news/517876/ibm-scientists-s...


The published paper is in science here: http://www.sciencemag.org/content/345/6197/668


This has a bit more detail: http://www.darpa.mil/NewsEvents/Releases/2014/08/07.aspx also links to Science.


Wonder if they are then going into direct competition with Qualcomm and Samsung; all these companies have quite active neuromorphic chip research groups going.


They did it using a Samsung die manufacturing process if I'm not mistaken.



If you are a scientist, here is the Epistemio page for rating and reviewing the scientific publication discussed here: http://www.epistemio.com/p/AJ09k7Yx


Is it a sort of general purpose neural-network hardware ?


'IBM Chip Processes Data Similar to the Way Your Brain Does'

Interesting, I did not know that we already know how the brain 'processes data'.


They are referring to the fact it uses a connectionist architecture rather than a von Neumann one.

https://en.wikipedia.org/wiki/Connectionism


That says most nothing. The brain uses various forms of neural networks of whose data processing we know relatively little. The IBM chip is at best somehow 'inspired' by the brain. It's far from working like it.


More specifically it's also a spiking neural network. You could probably program it to efficiently run very similar algorithms to human neurons.


Interesting, I did not know that you did not know that.


Heck, neither did I!


vonsydov's link is not dead and I don't know why his comment was downvoted or why I can't reply to it. There's nothing wrong with his link, though this one may have been slightly better

http://dx.doi.org/10.1126/science.1254642

More broadly, I don't understand why HN seems to prefer press pieces (so often containing more inaccuracies than useful information) to the papers on which they're based.

In this case, even if you can't access the full text, the single-paragraph abstract contains all of the new information in the 12-paragraph Tech Review story.


vonsydov's account has been hellbanned for 52 days.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: