Hacker News new | past | comments | ask | show | jobs | submit login

Being up front: this is what I work on for IBM Systems. A buddy wrote this blog (https://www.ibm.com/developerworks/community/blogs/fe313521-...) with a little more info.

What we have is an IO offload accelerator that knows how to drive high bandwidth IOs to some external storage device. A user app doesn't interact with the device - they make shared library calls to read or write data from a particular buffer, and the accelerator (because it's cache coherent) can read / write from the virtual address space of the user space program to satisfy the request as needed. This means that the IOs bypass the entire OS driver stack, since everything is a shared library call from user space.

So yep! That exists. :-) There's other classes of accelerators out there too (and coming in the future as well). Adding additional function like compression or some form of indexing or search is stuff that we've talked about.

(edit) - https://github.com/open-power/capiflash has the code for the shared libs, the APIs, and some examples.




That is really interesting, but I see this as a short term solution to a new and amazing world. What we are doing is trying to hammer something with potential to change most of CS to the shape of our current reality - what is understandable due to the commercial nature of these solutions.

But the CS community should think about this with a fresh point of view, maybe get back to the origins and start over with this kind of technology. Or maybe we do this already and I just do not know?

For myself, since I've got out of university, I've always thought about how things would be different if we hadn't disk+ram, but a storage that solved the two with the best of each (top speed and large and cheap capacity) - extrapolate this thing to a kind of SoC with 40+ cores and 40TB+ LD1 cache - and I tried to imagine what would be needed in terms of a new OS made from scratch for this thing. This still keeps me thinking on new designs, new algorithms, etc. Sadly, I've never tried or even theorized anything interesting apart of entirely killing the file-system concept and having _always loaded applications_ running (equivalent to processes) or suspended (equivalent to app binary files)... :)


Even more interesting with NUMA: imagine 10K slow/cheap cores, each with their own non-shared bit of NVMe (~10MB would do) for their heaps to live on. Perfect for running Erlang.

There might not even be a point in a "classical" CPU cache hierarchy in such a system, if the NVMe is fast enough, and has its own "internal" writeback cache (e.g. some volatile battery-backed memory) protecting it, so that cycling a bit at 3GHz doesn't burn it out. At that point you may as well say you have a CPU with ten million nonvolatile registers.


NVM as fast as registers, that is a bold ~1cycle per write/read! But would be really awesome.


Interesting!

Incidentally (since this may be somewhat related), I'm wondering, what are your thoughts on the Persistent Memory Manager approach, as in the following:

Justin Meza, Yixin Luo, Samira Khan, Jishen Zhao, Yuan Xie, and Onur Mutlu: "A Case for Efficient Hardware/Software Cooperative Management of Storage and Memory." Workshop on Energy-Efficient Design, 2013.

Context: "emerging high-performance NVM technologies enable a renewed focus on the unification of storage and memory: a hardware-accelerated single-level store, or persistent memory, which exposes a large, persistent virtual address space supported by hardware-accelerated management of heterogeneous storage and memory devices. The implications of such an interface for system efficiency are immense: A persistent memory can provide a unified load/store-like interface to access all data in a system without the overhead of software-managed metadata storage and retrieval and with hardware-assisted data persistence guarantees."

The stated goals/benefits include eliminating operating system calls for file operations, eliminating file system operations, and efficient data mapping.

Paper: http://justinmeza.com/bin/meza_weed13.pdf

Presentation: https://users.ece.cmu.edu/~omutlu/pub/mutlu_weed13_talk.pdf


One giant flat address space is not the answer. Hardware people tend to come up with approaches like that because flat address spaces and caching are well understood hardware. It's the same thinking that leads to "storing into device registers" as an approach to I/O control, even when the interface is really packets over a serial cable as in FireWire or USB or PCI Express.

File systems and databases are useful abstractions, from an ease of use, security, and robustness perspective. The challenge is to make them go faster. Pushing the machinery behind them out to special-purpose hardware can do that.

The straightforward thing to do first is to to take some FPGA part and use it to implement a large key/value store using non-volatile solid state memory. That's been done at Stanford[1], Berkeley[2], and MIT[3], and was suggested on YC about six years ago.[4] One could go further, and implement more of an SQL database back end. It's an interesting data structure problem; the optimal data structures are different when you don't have to wait for disk rotation, but do need persistence and reliability.

[1] http://csl.stanford.edu/~christos/publications/2014.hwkvs.nv... [2] https://www.cs.berkeley.edu/~kubitron/courses/cs262a-F14/pro... [3] https://dspace.mit.edu/handle/1721.1/91829 [4] https://news.ycombinator.com/item?id=1628550


OK I find it easier to follow these ideas when thinking about how loads/stores to volatile memory are organized. Memory is not accessed via a syscall. Instead the OS sets up some data structures in the MMU and lets the application run. Some kind of fault happens when control must be transferred back to the OS.

Going back to non-volatile memory the question is what kind of abstraction should be implemented in hardware? Presumably something simple that the OS and applications can then use to implement higher level abstractions like file systems and databases. Pushing parts of a SQL database engine into the hardware does not intuitively seem like a right solution.


Thanks, that is something that I was looking for! Nice to see it really happening!


This is pretty interesting.

It's possible that people might confuse this bit...

> The performance of SCMs means that systems must no longer "hide" them via caching and data reduction in order to achieve high throughput.

...in the original article with your mention of caching; by "cache coherency," I assume you're referring that your addon (card?) can introspect into the CPU cache? That's pretty awesome if that's what's happening.

Some hopefully relevant questions from someone totally unfamiliar with this particular area:

- The original article mentioned "RAM emulation" (to put it crudely) as "unstable." Do you have any comment on this?

- Do you happen to have any performance figures you can release?

- From the blog article and video I get the idea that this is POWER-specific. :) Are you aware of any alternative offerings for x86 that offer similar performance?

- What does this thing (I have no idea if it's a card, a module...) look like? Being able to see "the thing" is generally really cool :)

My last question about POWER8 in general is arguably both on- and off-topic and might be a question for a different team, but do you know...

a) if/when POWER8 will manage to escape from the datacenter and become accessible to developers in the hobbyist/student sector? My understanding is that the architecture as it stands at the moment requires lots of different components that unavoidably require a lot of space; are you aware of any scaling-down efforts to produce (even (E)ATX-sized) POWER8 SBCs people can play with?

b) if/when full-scale POWER8 systems will be available in the style of Heroku/OpenShift, both of which have free tiers that allow for entry-level poking? I understand that RunAbove provided something along these lines with (1-?) POWER system(s), but that dried up some time ago, and I'm not aware of any replacements.

All in all, this Flash system looks pretty cool, and I can definitely say I wouldn't mind being a fly on the wall for a day in your office, what with getting to play with 40TB of Flash (SSDs...?) - wow. :D


> I assume you're referring that your addon (card?) can introspect into the CPU cache? That's pretty awesome if that's what's happening.

Y - see page 5, section 3.1.1 of http://www-304.ibm.com/webapp/set2/sas/f/capi/CAPI_POWER8.pd... for some info. Also - it doesn't have to be a card. ;-) It just is that today...

> RAM Emulation

This is hard. Telling a program (and the OS) that different pages are fundamentally different will require some pretty drastic changes. For example- how does one malloc memory from an NVDIMM vs a regular DIMM, and differentiate between the two?

> Performance

Yes - as an example we can show that it takes ~26 threads on the CPU to drive ~450k IOPs to some external storage. Doing the same thing with the accelerated IO path requires about 4 HW threads on the main CPU. This kind of lines up with the point of the article.

> Are you aware of any alternative offerings for x86 that offer similar performance?

To my knowledge no one else has a similar architecture that's shipping today.

>- What does this thing (I have no idea if it's a card, a module...) look like? Being able to see "the thing" is generally really cool :)

http://www.nallatech.com/solutions/openpower-capi-developer-... or http://www.alpha-data.com/dcp/capi.php are your choices for Altera or Xilinx FPGA support (as of today).

> a) re: ATX

see http://www.enterprisetech.com/2014/10/08/tyan-ships-first-no... from last year. Go talk to Tyan if you want to buy one.

> b) if/when full-scale POWER8 systems will be available in the style of Heroku/OpenShift.

https://ptopenlab.com/cloudlabconsole/index.html has some boxes with CAPI cards...


Wow, thanks for taking the time to respond! :)

And now I get it: RAM emulation is not 100% stable due to the fact that application architecture simply isn't optimized at all to handle the interfaces yet, as opposed to flaky hardware (my initial arguably logical assumption). The article could have made that a little plainer, thanks for clearing that up.

And thanks for dropping those performance figures; if there was an ELI5-sized soundbite explaining the rationale behind this card, that would be it.

( http://reddit.com/r/ExplainLikeImFive (ELI5) explains complex subjects using accessible, respectful simplifications. If I may say so, your explanation fits precisely into that category. :P)

It's sad there's nothing like this for x86, but I wouldn't be too surprised if it were deemed too difficult to support an I/O path as performant as this without uncomfortable architectural changes. On that note, POWER8 is still at the point where it has the chance to lock in a future-proof architectural design, and hopefully it takes full advantage of that.

My mention of ATX was simply a reference to "it doesn't need to be tiny or cool, it just needs to exist," but it appears the board you linked is currently the only product with any sort of vague open market presence. I definitely look forward to more accessible POWER architecture products in the future. :D

Finally, thanks heaps for the PTOpenLab link! I'm still figuring out their points system and how that translates to daily usage allowance, but this looks incredibly cool. It's places like this that are laying the groundwork :)


Thanks for the links! ptopenlab seems very interesting! Question, just for curiosity, how much does it cost to have something like this?


These are my own hazy opinions, but I suspect it starts at "Wallet vaporizes from shock" and goes up from there.

I remember reading about how old IBM mainframes used to have a couple ThinkPads (literally two, for redundancy) bolted just inside the cabinet door, just to change low-level configuration settings. It seems to me that these POWER8 boxen are aimed toward that end of the market.

YouTube's history interface is terrible, but I managed to dig this out - https://www.youtube.com/watch?v=jOzPTopt7HE - which shows the different discrete components in a POWER8 system and how they're put together. I should probably do a bit more research on this, that video is quite basic (and 2 years old now).

POWER8 systems seem to necessarily take up a lot of space, and not does this contributes to the raw material cost, it's also a factor in renting, considering that you can pack a basic but decent punch with 1U or 2U of x86. My guess is that IBM isn't trying to be competitive here, but aim for a specific market. That'll influence the price too.

It's thanks to market factors and the state of education (which sometimes produces wins like these!) that places like PTOpenLab exist, I think (again, this is an [un]educated guess), and I'm super appreciative that they do. I haven't figured out how the "blue points" system works yet though (you get 500, and use 10/day for running a VM); I can at least say that the number doesn't increase each day. I vaguely recall reading something to the effect of creating HDD images for the platform would give you points based on how many other people downloaded them (there's somewhere you can upload to), but I can't find that documentation now.

Another fun tidbit: the dashboard UI is based on SmartAdmin (a premium jQuery plugin, apparently), which comes with Chrome-compatible voice control (note the mic button at the top-right). The voice command list doesn't show because of a 404, but you can find the list in app.config.js (F12 -> Network -> reload page) - scroll to the pile of "show"s. Useless, and horribly flaky, but extremely cool. :D


> "wallet vaporizes"

The SuperVessel lab's free, and there are several resources to rent time on a P8 VM if SuperVessel isn't appropriate.

Also, the video you found is the E8xx product line, which are at the high end of the enterprise / scale-up product line, and (incidentally) different from Mainframes.

Here's some links about the 2U / 2socket boxes if space of each node is important:

S822LC - https://www.youtube.com/watch?v=OdlLszagnos

S822L - https://www.youtube.com/watch?v=xF_fw_NJ5nI (not IBM's, but a reasonable unboxing video)

And if you're interested in videos, check out the IBM Power Systems youtube channel: https://www.youtube.com/user/ibmpowersystems/videos


Oh, TIL; I had no idea it was actually free. I understood that each user got 500 points and that you use 10/day, which gives you 50 days of usage... aaand then I'm not sure. I'm not dissing it, I just don't understand (and there's zero documentation).

And thanks for the video links! I'll definitely check out the YouTube channel.

PS. I'm getting multiple errors in the dashboard when I try to switch to NewYork1 zone. Where would be a good spot to mention this?





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: