Hybrid Memory Cube receives its finished spec, promises up to 320GB per second

m_mueller · on April 3, 2013

One should note that we already get 250GB/s peak on the GDDR5 used in NVIDIA Tesla K20x. Intel claims 320GB/s peak for the fastest MIC. What is claimed in this article is not that new then. From experience with tesla, you can usually expect about 70% of the peak bandwidth (and using Intel MIC with conventional x86 codebases it tends to be less, but that's second hand knowledge).

polskibus · on April 3, 2013

The problem is you need to transfer data out to use CUDA. I'm assuming with HMC the CPU can access data with promised 320GB/s speed?

m_mueller · on April 3, 2013

That's true, yes. Which is why the HPC community is moving towards having / leaving everything on GPU memory (except for the odd monitoring peaks, initializations and gathering of output of course). The GPU's architecture nowadays is general purpose enough that you can make this work - the tooling is still a work in progress though[1]. There's even some work going on for using GPUs for database purposes.

When it comes to bandwidth, the newest generation Teslas / Geforces tend to have a factor of 5 more than the latest Xeon running on all cores - I've talked to people at IBM who were able to get this speedup for random access as well - so it usually scales at 5x, no matter whether you look at peak, sustained or random access bandwidth.

[1] My current project is actually mainly done for this reason: https://github.com/muellermichel/Hybrid-Fortran

polskibus · on April 3, 2013

Thanks for the insight, interesting project. Would you mind sharing some links about gpu in/as database? I'm very interested in the idea, yet usually find only marketing babble ("yes we're working on it, it's been promising" kind of thing) Would love to dive into some research papers on the topic.

m_mueller · on April 4, 2013

I'm not an expert when it comes to GPGPU for DBs (heck, I'd like to even find one ;) ), but see below what I could come up with.

This is a presentation from a member of the IBM Almaden research group I've been talking to at GTC. I can't find much else online, their work seems to be unpublished, but maybe if you ask nicely, they could give you a few pointers: http://cis565-spring-2012.github.com/lectures/02-13-Search-T...

http://www.cs.virginia.edu/~skadron/Papers/bakkum_sqlite_tr....

http://gpuscience.com/software/postgresql-gpu-pgstrom/

Edit: Tbh. this still seems to be a very young field of research - which is good if you're a graduate student or a member of a private research institute, but as a startup I'd think twice before spending a few man years for making this work.

__alexs · on April 3, 2013

That's 250 GigaBit/s, this is 320 GigaByte/s. No?

unwind · on April 3, 2013

From the GDDR5 Wikipedia entry (http://en.wikipedia.org/wiki/GDDR5):

The newly developed GDDR5 is the fastest and highest density graphics memory available in the market. It operates at 7 GHz effective clock-speed and processes up to 28 GB/s with a 32-bit I/O.[4] 2 Gbit GDDR5 memory chips will enable graphics cards with 2 GiB or more of onboard memory with 224 GB/s or higher peak bandwidth.

So, no, it seems that's bytes.

__alexs · on April 3, 2013

Wow. That does seem to require a crazy 512-bit wide bus though. The info I can find on HMC seems like it is achieving these speeds on a 16-bit bus and with much lower power requirements than even DDR3 so far.

m_mueller · on April 3, 2013

Ok now that's new to me. If you can go 320GB/s on a measly 16bit bus, then it's really feasible to have this as your main memory. On the other hand imagine what you can do with this memory on a GPU, you'd probably get over 1TB/s there. Actually, that's something NVIDIA already has on their roadmap as far as I remember seeing it at GTC.

_juof · on April 3, 2013

In the spec they talk about 8-links each is 16 bit(or lanes in their jargon) , which comes out to 128 bit (or 256 bit - there are some configurations where input and output bits are on different pins).

Sephr · on April 3, 2013

It's gigabytes, not gigabits.

jws · on April 3, 2013

From the spec:

• External interface is multiple 10-15 Gbps SerDes interface, each with 16 full duplex lanes. (The 320GB number comes with 8 10Gbps links, the 4 link device is 240GB/s (higher clock rate))

• Internal ECC for memory, packet based interface with CRC and retry.

• Built in self test, there can be spare resources which allow it to replace failed sections.

Envision a city on a grid filled with sky scrapers. The ground floor of each skyscraper is the logic, called a "vault controller", each floor above is DRAM storage. The city is constructed by laminating chips, one for each layer, and the sky scrapers form their connections vertically through the chips.

There is a switching fabric that connects N serial links to M vault controllers.

• 16 vaults in the 4 link version, 4GB. 32 vaults in the 8 link version, 8GB.

• A single vault controller can be servicing many serial links simultaneously. It can prioritize. Within a single link, requests will always happen in order.

• There is a router system which allows up to 8 cubes to be on the same host link to increase storage per host link. Link length is limited and power demands are higher for longer links. I think the router will allow shorter links to be used, especially in multiple cube modules.

• Atomic bit write and atomic add transactions. New options for the lock free algorithm folk.

• 31mm^2 BGA. 4mm tall. For the 4 link device. About 900 pins. About half are grounds, 1/4 of the remainder are powers, the rest signals.

• 7 different power supplies at 4 different voltages required. Get to work board designers!

• READs and WRITEs are from 16 to 128 bytes wide.

• 4 link device can have up to 4GB, 8 link can have up to 8GB. (This seems small to me, but I suppose it comes down to storage/bandwidth balancing, and you can have 8 devices on the same link.) Oh, they see the problem too. They are considering using the currently ignored lower order bits of blocks to expand the addressing, and there are two bits reserved just above the address. Quick, someone get the time machine, take them to visit the IDE disk block addressing planners.

• The refresh logic checks ECC and rewrites if a soft error is found. Take that cosmic rays!

jcr · on April 3, 2013

blogspam is annoying. We're supposed to submit original sources whenever possible (according to the HN "Guidelines").

Original hybridmemorycube.org press release:

https://news.ycombinator.com/item?id=5485833

Original computerworld.com article:

https://news.ycombinator.com/item?id=5485823

zacharyvoase · on April 3, 2013

But what's the latency? I'm inclined to mention that a truck full of tapes hurtling down a freeway has a 'high bandwidth'.

unwind · on April 3, 2013

I tried searching the spec itself (http://hybridmemorycube.org/files/SiteDownloads/HMC_Specific...) but it doesn't seem to contain any specifications about the latency.

Lots of talk about latency-minimization though, but it seems this is basically a packet-oriented interface (with CRC on packets, retries and stuff) so I guess latency will be larger than with today's DDR interfaces.

Perhaps computer systems will have both DDR memory and HMC, letting the OS and/or applications decide how to distribute access for maximum performance.

jcr · on April 3, 2013

I haven't read the 1.0 spec yet but if the marketing in their FAQ is to be believed, they claim "will provide a substantial system latency reduction"

http://hybridmemorycube.org/faq.html

_juof · on April 3, 2013

The 1.0 spec is at:

hybridmemorycube.org/files/SiteDownloads/HMC_Specification_1_0.pdf

ChuckMcM · on April 3, 2013

This is fun stuff. That it can achieve these bandwidths without requiring an extra wide bus is also pretty impressive. Running a quad serdes into an FPGA or 64 bit CPU should be pretty straightforward (as opposed to an unweildly 256 bit wide GDDR5 bus). I so wonder what the power dissipation is like though. 10gbit SERDES ports on my switches get pretty warm (there is a XAUI phy connected to the 10gbit ports) Having 8 of them sitting under a chip seems like recipe for a hotplate.

Aardwolf · on April 3, 2013

I must say I had never heard of this before and it looks like one of those "too good to be true" things. I'll believe it when it's a consumer product.

revelation · on April 3, 2013

Wow, they just skip all the pretense and add a "show full pr-text" button.

EvilTerran · on April 3, 2013

I think that's meant to mean "press release text", but my first thought was definitely "public relations text".

zandorg · on April 3, 2013

Printer view maybe?