Non-volatile Storage: CPUs no longer more performant than I/O devices

Animats · on Jan 6, 2016

I've made that point before on YC.[1] We need to view fast storage as something other than a disk accessed through the OS, and other than slow RAM accessed as raw memory. Access through the OS is too slow, and access as raw memory is too risky. What's probably needed is something like a GPU sitting between the CPU and the fast persistent storage. Call this an SPU, or "storage processing unit."

What would such a device do? Manage indices, do data transformations, and protect data. Database-type indices would be maintained by the SPU, so applications couldn't mess up the database structure. The SPU would manage locking, so that many non-conflicting requests could be serviced simultaneously. The SPU would have tools for doing searches. Regular expression hardware (this exists) would be useful. Record protection management (app can read/write part but not all of a record) would allow implementation of database type data access rules. Encryption and compression might be provided in the SPU.

There have been smart disk controllers before, but they haven't been that useful, since they couldn't make the disk go any faster. Now, it's time to look at that layer again. Some of the technology can be borrowed from GPUs, but existing GPU architecture isn't quite right for the job. An SPU will be doing many unrelated tasks simultaneously. GPUs usually aren't used that way.

[1] https://news.ycombinator.com/item?id=9964319

mikehollinger · on Jan 6, 2016

Being up front: this is what I work on for IBM Systems. A buddy wrote this blog (https://www.ibm.com/developerworks/community/blogs/fe313521-...) with a little more info.

What we have is an IO offload accelerator that knows how to drive high bandwidth IOs to some external storage device. A user app doesn't interact with the device - they make shared library calls to read or write data from a particular buffer, and the accelerator (because it's cache coherent) can read / write from the virtual address space of the user space program to satisfy the request as needed. This means that the IOs bypass the entire OS driver stack, since everything is a shared library call from user space.

So yep! That exists. :-) There's other classes of accelerators out there too (and coming in the future as well). Adding additional function like compression or some form of indexing or search is stuff that we've talked about.

(edit) - https://github.com/open-power/capiflash has the code for the shared libs, the APIs, and some examples.

fungos · on Jan 6, 2016

That is really interesting, but I see this as a short term solution to a new and amazing world. What we are doing is trying to hammer something with potential to change most of CS to the shape of our current reality - what is understandable due to the commercial nature of these solutions.

But the CS community should think about this with a fresh point of view, maybe get back to the origins and start over with this kind of technology. Or maybe we do this already and I just do not know?

For myself, since I've got out of university, I've always thought about how things would be different if we hadn't disk+ram, but a storage that solved the two with the best of each (top speed and large and cheap capacity) - extrapolate this thing to a kind of SoC with 40+ cores and 40TB+ LD1 cache - and I tried to imagine what would be needed in terms of a new OS made from scratch for this thing. This still keeps me thinking on new designs, new algorithms, etc. Sadly, I've never tried or even theorized anything interesting apart of entirely killing the file-system concept and having _always loaded applications_ running (equivalent to processes) or suspended (equivalent to app binary files)... :)

derefr · on Jan 7, 2016

Even more interesting with NUMA: imagine 10K slow/cheap cores, each with their own non-shared bit of NVMe (~10MB would do) for their heaps to live on. Perfect for running Erlang.

There might not even be a point in a "classical" CPU cache hierarchy in such a system, if the NVMe is fast enough, and has its own "internal" writeback cache (e.g. some volatile battery-backed memory) protecting it, so that cycling a bit at 3GHz doesn't burn it out. At that point you may as well say you have a CPU with ten million nonvolatile registers.

jdright · on Jan 7, 2016

NVM as fast as registers, that is a bold ~1cycle per write/read! But would be really awesome.

matt_d · on Jan 6, 2016

Interesting!

Incidentally (since this may be somewhat related), I'm wondering, what are your thoughts on the Persistent Memory Manager approach, as in the following:

Justin Meza, Yixin Luo, Samira Khan, Jishen Zhao, Yuan Xie, and Onur Mutlu: "A Case for Efficient Hardware/Software Cooperative Management of Storage and Memory." Workshop on Energy-Efficient Design, 2013.

Context: "emerging high-performance NVM technologies enable a renewed focus on the unification of storage and memory: a hardware-accelerated single-level store, or persistent memory, which exposes a large, persistent virtual address space supported by hardware-accelerated management of heterogeneous storage and memory devices. The implications of such an interface for system efficiency are immense: A persistent memory can provide a unified load/store-like interface to access all data in a system without the overhead of software-managed metadata storage and retrieval and with hardware-assisted data persistence guarantees."

The stated goals/benefits include eliminating operating system calls for file operations, eliminating file system operations, and efficient data mapping.

Paper: http://justinmeza.com/bin/meza_weed13.pdf

Presentation: https://users.ece.cmu.edu/~omutlu/pub/mutlu_weed13_talk.pdf

Animats · on Jan 6, 2016

One giant flat address space is not the answer. Hardware people tend to come up with approaches like that because flat address spaces and caching are well understood hardware. It's the same thinking that leads to "storing into device registers" as an approach to I/O control, even when the interface is really packets over a serial cable as in FireWire or USB or PCI Express.

File systems and databases are useful abstractions, from an ease of use, security, and robustness perspective. The challenge is to make them go faster. Pushing the machinery behind them out to special-purpose hardware can do that.

The straightforward thing to do first is to to take some FPGA part and use it to implement a large key/value store using non-volatile solid state memory. That's been done at Stanford[1], Berkeley[2], and MIT[3], and was suggested on YC about six years ago.[4] One could go further, and implement more of an SQL database back end. It's an interesting data structure problem; the optimal data structures are different when you don't have to wait for disk rotation, but do need persistence and reliability.

[1] http://csl.stanford.edu/~christos/publications/2014.hwkvs.nv... [2] https://www.cs.berkeley.edu/~kubitron/courses/cs262a-F14/pro... [3] https://dspace.mit.edu/handle/1721.1/91829 [4] https://news.ycombinator.com/item?id=1628550

dunkelheit · on Jan 6, 2016

OK I find it easier to follow these ideas when thinking about how loads/stores to volatile memory are organized. Memory is not accessed via a syscall. Instead the OS sets up some data structures in the MMU and lets the application run. Some kind of fault happens when control must be transferred back to the OS.

Going back to non-volatile memory the question is what kind of abstraction should be implemented in hardware? Presumably something simple that the OS and applications can then use to implement higher level abstractions like file systems and databases. Pushing parts of a SQL database engine into the hardware does not intuitively seem like a right solution.

fungos · on Jan 6, 2016

Thanks, that is something that I was looking for! Nice to see it really happening!

i336_ · on Jan 6, 2016

This is pretty interesting.

It's possible that people might confuse this bit...

> The performance of SCMs means that systems must no longer "hide" them via caching and data reduction in order to achieve high throughput.

...in the original article with your mention of caching; by "cache coherency," I assume you're referring that your addon (card?) can introspect into the CPU cache? That's pretty awesome if that's what's happening.

Some hopefully relevant questions from someone totally unfamiliar with this particular area:

- The original article mentioned "RAM emulation" (to put it crudely) as "unstable." Do you have any comment on this?

- Do you happen to have any performance figures you can release?

- From the blog article and video I get the idea that this is POWER-specific. :) Are you aware of any alternative offerings for x86 that offer similar performance?

- What does this thing (I have no idea if it's a card, a module...) look like? Being able to see "the thing" is generally really cool :)

My last question about POWER8 in general is arguably both on- and off-topic and might be a question for a different team, but do you know...

a) if/when POWER8 will manage to escape from the datacenter and become accessible to developers in the hobbyist/student sector? My understanding is that the architecture as it stands at the moment requires lots of different components that unavoidably require a lot of space; are you aware of any scaling-down efforts to produce (even (E)ATX-sized) POWER8 SBCs people can play with?

b) if/when full-scale POWER8 systems will be available in the style of Heroku/OpenShift, both of which have free tiers that allow for entry-level poking? I understand that RunAbove provided something along these lines with (1-?) POWER system(s), but that dried up some time ago, and I'm not aware of any replacements.

All in all, this Flash system looks pretty cool, and I can definitely say I wouldn't mind being a fly on the wall for a day in your office, what with getting to play with 40TB of Flash (SSDs...?) - wow. :D

mikehollinger · on Jan 6, 2016

> I assume you're referring that your addon (card?) can introspect into the CPU cache? That's pretty awesome if that's what's happening.

Y - see page 5, section 3.1.1 of http://www-304.ibm.com/webapp/set2/sas/f/capi/CAPI_POWER8.pd... for some info. Also - it doesn't have to be a card. ;-) It just is that today...

> RAM Emulation

This is hard. Telling a program (and the OS) that different pages are fundamentally different will require some pretty drastic changes. For example- how does one malloc memory from an NVDIMM vs a regular DIMM, and differentiate between the two?

> Performance

Yes - as an example we can show that it takes ~26 threads on the CPU to drive ~450k IOPs to some external storage. Doing the same thing with the accelerated IO path requires about 4 HW threads on the main CPU. This kind of lines up with the point of the article.

> Are you aware of any alternative offerings for x86 that offer similar performance?

To my knowledge no one else has a similar architecture that's shipping today.

>- What does this thing (I have no idea if it's a card, a module...) look like? Being able to see "the thing" is generally really cool :)

http://www.nallatech.com/solutions/openpower-capi-developer-... or http://www.alpha-data.com/dcp/capi.php are your choices for Altera or Xilinx FPGA support (as of today).

> a) re: ATX

see http://www.enterprisetech.com/2014/10/08/tyan-ships-first-no... from last year. Go talk to Tyan if you want to buy one.

> b) if/when full-scale POWER8 systems will be available in the style of Heroku/OpenShift.

https://ptopenlab.com/cloudlabconsole/index.html has some boxes with CAPI cards...

i336_ · on Jan 7, 2016

Wow, thanks for taking the time to respond! :)

And now I get it: RAM emulation is not 100% stable due to the fact that application architecture simply isn't optimized at all to handle the interfaces yet, as opposed to flaky hardware (my initial arguably logical assumption). The article could have made that a little plainer, thanks for clearing that up.

And thanks for dropping those performance figures; if there was an ELI5-sized soundbite explaining the rationale behind this card, that would be it.

( http://reddit.com/r/ExplainLikeImFive (ELI5) explains complex subjects using accessible, respectful simplifications. If I may say so, your explanation fits precisely into that category. :P)

It's sad there's nothing like this for x86, but I wouldn't be too surprised if it were deemed too difficult to support an I/O path as performant as this without uncomfortable architectural changes. On that note, POWER8 is still at the point where it has the chance to lock in a future-proof architectural design, and hopefully it takes full advantage of that.

My mention of ATX was simply a reference to "it doesn't need to be tiny or cool, it just needs to exist," but it appears the board you linked is currently the only product with any sort of vague open market presence. I definitely look forward to more accessible POWER architecture products in the future. :D

Finally, thanks heaps for the PTOpenLab link! I'm still figuring out their points system and how that translates to daily usage allowance, but this looks incredibly cool. It's places like this that are laying the groundwork :)

fungos · on Jan 7, 2016

Thanks for the links! ptopenlab seems very interesting! Question, just for curiosity, how much does it cost to have something like this?

i336_ · on Jan 8, 2016

These are my own hazy opinions, but I suspect it starts at "Wallet vaporizes from shock" and goes up from there.

I remember reading about how old IBM mainframes used to have a couple ThinkPads (literally two, for redundancy) bolted just inside the cabinet door, just to change low-level configuration settings. It seems to me that these POWER8 boxen are aimed toward that end of the market.

YouTube's history interface is terrible, but I managed to dig this out - https://www.youtube.com/watch?v=jOzPTopt7HE - which shows the different discrete components in a POWER8 system and how they're put together. I should probably do a bit more research on this, that video is quite basic (and 2 years old now).

POWER8 systems seem to necessarily take up a lot of space, and not does this contributes to the raw material cost, it's also a factor in renting, considering that you can pack a basic but decent punch with 1U or 2U of x86. My guess is that IBM isn't trying to be competitive here, but aim for a specific market. That'll influence the price too.

It's thanks to market factors and the state of education (which sometimes produces wins like these!) that places like PTOpenLab exist, I think (again, this is an [un]educated guess), and I'm super appreciative that they do. I haven't figured out how the "blue points" system works yet though (you get 500, and use 10/day for running a VM); I can at least say that the number doesn't increase each day. I vaguely recall reading something to the effect of creating HDD images for the platform would give you points based on how many other people downloaded them (there's somewhere you can upload to), but I can't find that documentation now.

Another fun tidbit: the dashboard UI is based on SmartAdmin (a premium jQuery plugin, apparently), which comes with Chrome-compatible voice control (note the mic button at the top-right). The voice command list doesn't show because of a 404, but you can find the list in app.config.js (F12 -> Network -> reload page) - scroll to the pile of "show"s. Useless, and horribly flaky, but extremely cool. :D

mikehollinger · on Jan 8, 2016

> "wallet vaporizes"

The SuperVessel lab's free, and there are several resources to rent time on a P8 VM if SuperVessel isn't appropriate.

Also, the video you found is the E8xx product line, which are at the high end of the enterprise / scale-up product line, and (incidentally) different from Mainframes.

Here's some links about the 2U / 2socket boxes if space of each node is important:

S822LC - https://www.youtube.com/watch?v=OdlLszagnos

S822L - https://www.youtube.com/watch?v=xF_fw_NJ5nI (not IBM's, but a reasonable unboxing video)

And if you're interested in videos, check out the IBM Power Systems youtube channel: https://www.youtube.com/user/ibmpowersystems/videos

i336_ · on Jan 9, 2016

Oh, TIL; I had no idea it was actually free. I understood that each user got 500 points and that you use 10/day, which gives you 50 days of usage... aaand then I'm not sure. I'm not dissing it, I just don't understand (and there's zero documentation).

And thanks for the video links! I'll definitely check out the YouTube channel.

PS. I'm getting multiple errors in the dashboard when I try to switch to NewYork1 zone. Where would be a good spot to mention this?

mikehollinger · on Jan 8, 2016

SuperVessel's free! http://www.zdnet.com/article/ibm-launches-supervessel-a-free...

hadagribble · on Jan 7, 2016

(Disclosure: one of the authors here)

Wrapping a bunch of replies into one: Absolutely agree that we need to view fast storage as something other than disk behind a block interface and slow memory, especially with all the different flavours of fast persistent storage that seem to be on the horizon. For the one's that attach to the memory bus, the PMFS-style [1] approach of treating them like a file-system for discoverability and then mmaping to allow them to be accessed as memory is pretty attractive.

I'm not sure a dedicated storage processing unit is the way to go though; I think we could equally well see bits of functionality being offloaded to smarter controllers (kind of like checksum, VLANs, etc are on network adapters) while the CPU remains in charge of orchestrating the different bits.

Also agree on the fact that it is an interesting data structure problem -- a lot of the work we do involves examining what the right data structures are for things once seeks are free and cache locality is the dominant factor for operations.

[1] http://dl.acm.org/citation.cfm?id=2592814

shimon · on Jan 6, 2016

This sounds interesting, but why should this be a new piece of hardware as opposed to a new OS service? Are these functions simply so specialized that implementing them in the OS would be a bottleneck (even though the CPU has plenty of free cycles)?

jerf · on Jan 6, 2016

"This sounds interesting, but why should this be a new piece of hardware as opposed to a new OS service?"

Because the entire point is that CPUs are too slow by themselves, even without the OS, let alone with it. While you were context-switching into this OS server you missed the chance to do 10 IOPs give or take an order of magnitude.

Yet the OS really can't go anywhere. We can sometimes poke a hole here and sometimes poke a hole there but in general they're there for good reasons and not going anywhere, just as no matter what crazy things we bodge in to our computers "things like CPUs" aren't going anywhere either, and my guess is they're likely to stay pretty "central", too.

amluto · on Jan 6, 2016

> While you were context-switching into this OS server you missed the chance to do 10 IOPs give or take an order of magnitude.

I'll believe that when I see real numbers.

A system call takes something like 54 ns on my laptop. With pwritev or similar, you can do quite a few IOs in a system call. (Of course, pwritev is slower than 54 ns, but that's not a fundamental constraint.)

An IO requires making the IO durable if you want it to be reliably persistent. So you have to do CLWB; SFENCE; PCOMMIT; SFENCE or whatever magic sequence you're using (depends on IO type and use of nontemporal instructions, (and you have to have hardware that supports that). If you're using NVMe instead of NVDIMMs, then you have to do an IO to sync with the controller, and that IO will be uncached.

Uncached IO is slow. PCOMMIT has unknown performance since no one has the hardware yet. System calls are fast.

slashdev · on Jan 6, 2016

The syscall overhead isn't the problem, that's dirt cheap as you say. The problem is the context-switch overhead. Calling into the OS flushes a lot of data and instructions from the cache, and that lost performance after returning can easily add up to around 30µs.[1]

[1] http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-ma...

CamperBob2 · on Jan 6, 2016

Sounds like multiple caches are an obvious solution. No code in the OS needs access to the user code or data cache, and vice versa.

Smart cache management might be one of Intel's goals for the Altera buyout.

xxs · on Jan 7, 2016

>> No code in the OS needs access to the user code or data cache

This is not true for the data. How do you pass any data structures outside CPU registers then, say the path to a file to open. Normally it's a char*[0] (indeed passed in a register) but then the OS actually reads the data off the process memory (L1 data cache usually)

[0]: http://linux.die.net/man/3/open

CamperBob2 · on Jan 7, 2016

I'd say a cache isn't the right structure for passing around data (or references to data) that you know will be accessed very soon by completely different code.

As userland code, you'd like to grant the OS access to a particular subset of lines in D$ while keeping it out of your C$ altogether. Traditional implementations fail in both respects.... and at the same time, the OS probably can't take advantage of its own historical locality because userland has evicted it since the last call.

From what other people are saying it sounds like these problems are being worked on, though.

slashdev · on Jan 6, 2016

It's already here[1], but I don't think it can be used for cache partitioning between the kernel and userspace (but I could be wrong!)

[1] http://danluu.com/intel-cat/

hadagribble · on Jan 7, 2016

(Disclosure: one of the authors here)

I don't think CAT can be used to partition kernel and userspace -- I'm not even sure how you'd go about doing that given that you can (and do) have shared pages between them.

That being said, from our experiments, if you're using userspace network and NVMe drivers the context switch and associated cache pollution is not a problem, since it is happening pretty infrequently (primarily just timer interrupts, and those can be turned off, but we haven't needed to).

uxcn · on Jan 7, 2016

I think Altera had more to do with generic user modifiable compute. They have been working with FPGAs for a while now (http://www.enterprisetech.com/2014/06/18/intel-mates-fpga-fu...).

ori_b · on Jan 7, 2016

> OS flushes a lot of data and instructions from the cache

Only if you actually switch contexts. For I/O, this isn't necessary -- you don't need to touch a thing in the TLB.

uxcn · on Jan 6, 2016

One of the other constraints is the actual data copy. I don't have any benchmarks on hand, but you pay the penalty for the copy, potentially cache misses, and the potential TLB miss. Obviously, there are ways to avoid it without resorting to bypassing the kernel, but there's still a non-negligible cost.

Maybe it would be beneficial to have a coherent interface as well, considering NVMe.

xxs · on Jan 6, 2016

You do not context switch to the OS, it's a mode switch x2 but it's cheaper than a context switch.

woodchuck64 · on Jan 6, 2016

Well the article does say: "Our own experience has been that efforts to saturate PCIe flash devices often require optimizations to existing storage subsystems, and then consume large amounts of CPU cycles. "

ricw · on Jan 6, 2016

Just to clarify: this is to enable programs to have more a more direct and faster interface to data, while retaining data consistency and safety that typically would be managed by the OS?!

Is this not just another form of DMA (direct memory access)? And if so, how would it differ from current implementations? Sometimes DMA only refers to ram, though on many systems this is fluid between different data storage types (AMDs direct compute comes to mind), exactly to enable this kind of data access.

Animats · on Jan 6, 2016

"this is to enable programs to have more a more direct and faster interface to data, while retaining data consistency and safety that typically would be managed by the OS?"

No, managed by the database engine. The idea is to put the data-intensive operations of a database engine into a highly parallel SPU. The application would see an interface much like an SQL or NoSQL database.

dunkelheit · on Jan 6, 2016

Maybe that's a dumb question but why should this chip be highly parallel? In the case of GPU computations are by their nature embarrassingly parallel but I don't see why this should be the case for SPU. Why not just add a few more CPU cores and let this specialized controller handle the problem of concurrent accesses?

AnimalMuppet · on Jan 6, 2016

But see, you're limiting your idea to databases. I don't think you should do that. The idea of a "storage accelerator" (by whatever name) could be used far more widely than that.

vetinari · on Jan 6, 2016

It sounds like Oracle Exadata Storage servers - however they used regular Xeons.

hobo_mark · on Jan 6, 2016

This looks like the perfect job for an FPGA (with a fast enough interconnect, poster above mentions CAPI which sits on top of PCIe but I have not had a chance to try it out yet).

mikehollinger · on Jan 6, 2016

I'm probably the poster above. ;-) Yes, we layer on top of PCIe for the physical transport, but once an adapter's in CAPI mode, it's able to do translations, participate in locks, and looks more or less like a slightly-strange other thread as far as code running on the main CPU is concerned.

Since the logic inside the accelerator can do pointer chasing, it can communicate directly with the application and bypass a lot of the stuff that happens when a normal IO occurs to other FPGAs today.

hobo_mark · on Jan 6, 2016

Yes indeed, I've read about redis acceleration, are relational dbs next? Is anyone working on that?

mikehollinger · on Jan 6, 2016

Some colleagues from our research group are working on proving the benefits for other types of workloads than in-memory DBs.

undersuit · on Jan 6, 2016

Sound similar to Channel I/O.

https://en.wikipedia.org/wiki/Channel_I/O.

Animats · on Jan 6, 2016

Much more than that. Channel I/O on IBM mainframes is mostly about watching the data go by as the disk rotates until some key matches. With a random-access storage device, that's unnecessary. The SPU concept is more about looking up things in index trees, and updating those trees safely.

DenisM · on Jan 7, 2016

If you're going to have database-like features, might as well just make the entire machine a dedicated database server.

Having a database engine accessing the entire storage device (in a shared-memory like setup) is no more dangerous than having a database engine accessing all of its own memory and disk.

So we can have one server for SQL-like things, one more server for NoSQL like things, and one more server for storing blobs. And that's that.

Avshalom · on Jan 6, 2016

isn't this basically what Channel IO is?

ChuckMcM · on Jan 6, 2016

This is so true, the world has changed greatly and not everyone has gotten the memo. I saw a really cool device made by Texas Memory systems which was a "ram disk" that was all ram with disk backing, and when you lost power it flushed to disk. I wanted something that worked better for a storage paradigm and designed/invented a network accessible memory appliance[1]. Basically using ethernet packets you could store 8K integrity protected chunks right there on the network. Initially I wanted to use a typical low power CPU with a bunch of DRAM attached but the CPU bottleneck got in the way, so we redesigned/rebuilt it out of FPGAs so that it had a couple of terabytes of RAID protected RAM in an appliance with a very simple network protocol for storing and fetching 8K blocks out of what was essentially a linear address space. Two of these on different power subsystems provided all of the fault tolerance you needed and you could have a terabyte of 'structured' data live from the moment your computer booted (made for very fast recovery from reboot).

[1] https://www.google.com/patents/US8316074

AndrewKemendo · on Jan 7, 2016

That is fascinating. Could this reliably reduce the hardware footprint on any device?

ChuckMcM · on Jan 7, 2016

If I understand the question, then yes. When you consider the amount of cache memory in clustered systems which is all holding the same stuff in every independent machine. Using it simply as a victim cache for a block storage device penciled out to a pretty significant improvement.

It gets even better with 64 bit address spaces and a bit of kernel code to 'fault in' from the device.

erichocean · on Jan 6, 2016

My sense is this is only true today because OS kernels are ridiculously slow relative to what the hardware can achieve.

Most of my recent designs treat RAM as if it were (what we used to considered to be) disks, i.e. all computation and in-process data is in cache exclusively, and "going to RAM" requires the use of a B-tree-like structure to amortize the cost.

For example, once you've opened a RAM page line on a normal four-channel Xeon server, you can read the entire 4KB page in about the same time it takes to read one byte, switch pages, and then read another byte. (Of course, you can't do that either since the entire cache line will be filled, but the overall point still stands.)

The situation we're in today with RAM is pretty much the identical situation with the disks of yore. Anyway…interesting article nonetheless.

jleahy · on Jan 8, 2016

Right, modern CPUs can do 50 gigaflops per core. There's absolutely no chance we're going to have non-volatile storage that can do hundreds of billions of IOPS any time soon (if only because you won't be able to get that much data over PCI-express).

Further given you can saturate 16 lanes of PCIe when talking to a GPU there's no reason you shouldn't be able to do the same for storage, it's just a matter of having the right abstractions and the right kind of thinking like you're saying.

It sounds more like storage and RAM are going to converge (and people are still learning to deal with how slow RAM is compared to the CPU these days).

thescriptkiddie · on Jan 6, 2016

Not that is some interesting thinking. You got a blog post elaborating on this?

hadagribble · on Jan 7, 2016

Not sure exactly what OP is referring to, but CSS-trees [1] are a classic example of cache-aware indexing structures that fetch entire pages into cache and arrange data so that most of the comparisons happen on cached data. In most cases, they significantly outperform binary trees. Masstree [2] is a more recent example of this.

[1] http://www.vldb.org/conf/1999/P7.pdf

[2] https://pdos.csail.mit.edu/papers/masstree:eurosys12.pdf

rdtsc · on Jan 6, 2016

Yeah per packet processing at 40Gbps and higher is problematic on regular kernels, OS stack and CPUs. A lot of really can be cache hits -- hundreds of nanoseconds. Article mentions that too:

--- To put these numbers in context, acquiring a single uncontested lock on today's systems takes approximately 20ns, while a non-blocking cache invalidation can cost up to 100ns, only 25x less than an I/O operation. ---

It also depends if workload is throughput sensitive or latency sensitive. If it is latency, can do things like tie processes and interrupts to cores, isolate those cores, etc. For throughput can processes more than one packet at a time perhaps.

Then there is dpdk and even unikernels.

> CPU has responsibilities beyond simply servicing a device—at the very least, it must process a request and act as either a source or a sink for the data linked to it. In the case of data parallel frameworks such as Hadoop and Spark,7,17 the CPU

That's why you get more CPUs and explicitly isolate them if you can. But now depending on how they share data with other CPUs there will be invalidated cache lines so will pay that way as well.

In general if you run on RHEL / CentOS ( a lot banks, military and enterprise deployments do ), there is this helpful guide as an overview:

https://access.redhat.com/documentation/en-US/Red_Hat_Enterp...

vvanders · on Jan 6, 2016

Cache miss latency was the first thing that popped into my mind as well when I saw the title.

It seems like they don't make a clear distinction between latency and bandwidth. From the little I know on SSDs(don't claim to be an expert here) the sequential reads are below or on-par with high spindle speed disks.

What seems to be a better take-away would be that sequencing of your reads isn't nearly as important as it used to be. Back in games we'd duplicate data across a DVD so that we could do "seekfree" loading where duplicating 5-10MB of data would mean just a single big call to read() and gain massive load time performance.

frankchn · on Jan 6, 2016

SSDs are still faster than hard disks even with sequential reads. 15,000 rpm enterprise spinning disks read at about 260 MB/s [1] while NVMe SSDs (like those in a 2015 MacBook Pro) reads at >1300 MB/s [2].

[1]: http://www.tomshardware.com/charts/enterprise-hdd-charts/-02...

[2]: http://www.computerworld.com/article/2900330/apple-mac/holy-...

rdtsc · on Jan 6, 2016

> NVMe SSDs (like those in a 2015 MacBook Pro) reads at >1300 MB/s [2].

No joke. Just got an new MBP for work. Before had a spinning disk (well hybrid). Was running some silly benchmarks that I ran before and clocked my disk throughput at about 100MB/s. On MBP got 800MB/s. I thought something broke (was hitting page cache or some trickery like that) or didn't compile things right. But no, tried other tools, looked online and it seemed correct. It really surprised me.

creshal · on Jan 6, 2016

> From the little I know on SSDs(don't claim to be an expert here) the sequential reads are below or on-par with high spindle speed disks.

Not quite, but they're "only" a 2-5 times faster, not orders of magnitude faster, as with random accesses.

sgt101 · on Jan 6, 2016

I have just bought a machine with one of these drives and a bunch of GPU's with the intent of running tensor-flow on it. Do you think that Fedora will allow for the kind of tuning you describe or should I stick with Centos? I was thinking that I would need Fedora because these new devices need new kernels and drivers and so on, but maybe I am just confused?

rdtsc · on Jan 6, 2016

I would guess give Fedora a try or whatever GPU drivers support better.

CentOS / RHEL as a rule is behind as far as package versions in order to be more stable. But they do bring in new drivers and back-port many fixes and packages.

Most of the stuff that applies to CentOS will apply to Fedora as well as a rule.

sgt101 · on Jan 6, 2016

thank you.

worried_citizen · on Jan 6, 2016

Fedora and centos are effectively the same OS with different release goals. You should be able to do anything on either unless you need truly cutting edge.

emcq · on Jan 6, 2016

I like the article, but not your title. It implies that this trend of I/O becoming highly performant has occurred recently, when in fact has been observed and studied for quite some time [0, 1]. Even before SSDs Gigabit ethernet was saturating CPUs needing to do more than DMA a packet, and I'm sure this trend continued for some time. The original title seems more accurate: "Implications of the Datacenter's Shifting Center", and references the existing trends in an insightful article.

[0] http://ucsdnews.ucsd.edu/archive/newsrel/supercomputer/11-09...

[1] http://nvsl.ucsd.edu/index.php?path=pubs

dogma1138 · on Jan 6, 2016

So NVDIMM's... Is any one actually making those except Viking and is anyone actually supporting them in servers except SuperMicro?

These are basically DDR3/DDR4 DIMM's with onboard flash and a supercap/battery pack to provide persistence incase of system reboots and power failures.

They are also a bit odd as they would ignore various system event calls from the BIOS/UEFI and then have to be specifically managed by various software hacks that create RAM drives and access the memory directly rather than working with OS virtual memory. Since NVDIMM's are basically treated as system memory by both the server and the OS they pretty much only work for very very boutique applications I'ts a bit odd that these are presented as the next step in storage evolution while being effectively an overpriced hack. I've only seen them actually been use in weird server setups like the overclocked watercooled servers that are used for HFT where they strip everything even the OS as possible and bypass anything that adds even a few NS of latency and don't mind running their own code for everything from a bastardized TCP stack that isn't even remotely compliant but works to their own in memory custom database.

wmf · on Jan 6, 2016

Intel will be pushing them hard starting with Skylake-EP, so some people are getting themselves ready.

dogma1138 · on Jan 7, 2016

Did Intel created a new interface for NVDIMM's? because to work with the ones Viking makes you pretty much need to hack your linux kernel to ensure that it doesn't access physical memory over a certain address range, and I don't even know if or how can you use them on Windows based applications.

wmf · on Jan 7, 2016

Yes, Intel basically controls UEFI/ACPI and has been submitting patches to Linux for a while. http://pmem.io/

baq · on Jan 7, 2016

see https://en.wikipedia.org/wiki/3D_XPoint

zinxq · on Jan 6, 2016

This trend has been clear for awhile. Interestingly, this will put performance pressure back on programming and languages as they become the "new" bottleneck.

I'd expect an implicit migration away from slower languages toward faster ones.

andrewvc · on Jan 6, 2016

To some extent, yes, but 'slow' languages usually delegate batch work over large datasets to optimized libraries.

Even more likely, as I see it, is this contributing to the increasing rise of tools like spark, hadoop, etc. Slow languages will continue to be popular as orchestration around these tools.

pmehra · on Jan 8, 2016

Both Tandem DP2 and IBM Coupling Facility on zSeries Sysplexes worked exactly like how you envisage an SPU would work. Therefore, when we developed RDMA-attached persistent memory at Tandem in 2002, we put it under the control of DP2/ADP process pairs. Later, we ported it to HP-UX and InfiniBand RDMA. There is one paper at IPDPS'04 and several published patents you can look up in my Google Scholar page. pmem.io crowd is reinventing some of this wheel. If any of you work at HPE, you can find much more detailed internal papers, source code, drivers, firmware, and other stuff that the outside world cannot get to.

marcosdumay · on Jan 6, 2016

> and the performance of an SCM (hundreds of thousands of I/O operations per second) is such that one or more entire many-core CPUs are required to saturate it.

So, we are getting a lot of data, but latency is still killing. (Even more taking into account that this thing has a few stages of pipeline inside.)

Anyway, our CPU is getting distributed nearer IO and memory. We are going to get NUMA machines, everything points at it.

CyberDildonics · on Jan 6, 2016

Knights Landing from Intel is already NUMA (I think). I'm not sure if it can be bought yet, but it should be very close to release.

thescriptkiddie · on Jan 6, 2016

Aren't AMD parts already (cache-coherent) NUMA?

marcosdumay · on Jan 6, 2016

Well, a bit, but not radical enough to be visible to its software.

rsp1984 · on Jan 6, 2016

Please excuse my ignorance on this matter but will this technology have any impact on the hierarchy levels below disk (i.e. RAM and CPU caches)? Compared to Register, L1 and L2 access RAM access is still really slow. Will non-volatile storage latencies rival or exceed those of standard RAM? From how I understand the article it's primarily disk IO speed that's affected, correct?

hadagribble · on Jan 7, 2016

Yes. Even the persistent memories that attach to the memory bus are currently quite a bit slower than DRAM (5-7x from estimates I've seen), while the difference with PCIe-attached ones is even more.

I'm not sure what the future holds in terms of latencies for non-volatile storage but sub-DRAM levels aren't within reach yet.

matt_d · on Jan 8, 2016

On a side note, it's interesting to me that emerging memory technologies currently seem to be mainly focused on addressing the "from-DRAM-to-disk" part of the memory hierarchy.

That is, as you mentioned, not directly competing with DRAM, and consistently on the same side of the 1 microsecond dividing line between memory and storage; as in:

http://www.rambusblog.com/2015/10/15/mid-when-memory-and-sto... (note SCM placed between DRAM and SSD)

http://semiengineering.com/the-memory-and-storage-hierarchy/

As far as the other side of the line is concerned, I think I've only seen proposals for hybrid-cache architectures (HCA) -- other than http://link.springer.com/chapter/10.1007%2F978-1-4419-9551-3... -- with a hybrid approach (e.g., combining SRAM/eDRAM/STT-RAM/PCRAM) probably making sense due to latency/endurance/bandwidth trade-offs.

If anything, there seems to be more development on the DRAM interface itself -- with multiple candidates for the (or a) DDR4's successor, so far involving Wide I/O (Samsung), Hybrid Memory Cube (Intel, Micron), High Bandwidth Memory (SK Hynix, AMD, Nvidia): http://www.extremetech.com/computing/197720-beyond-ddr4-unde...

(Latency and bandwidth improvements seem promising, http://semiengineering.com/which-memory-type-should-you-use/)

One interesting development I've seen involves reducing SRAM's footprint, by moving from 6T (6-transistors) cell to a 1T (one-transistor) one: http://www.eetimes.com/document.asp?doc_id=1328453

It's fairly recent development, though, and it remains to be seen how is it going to fare.

Other than the above, there doesn't really seem to be much progress around competing with/improving SRAM. However, this may become increasingly important, since some of the technological process scaling issues apply to SRAM, too.

teraflop · on Jan 6, 2016

For anyone else who was momentarily confused: on figure 2, the y-axis scale is incorrectly labeled "ns" when it should be "ms".

rasz_pl · on Jan 7, 2016

Anyone old enough will remember the time hard drivers were correctly marketed with access time in milliseconds as the main speed indicator. This ended around 1994() when pretty much all the drives reached ~10ms access time.
https://en.wikipedia.org/wiki/Hard_disk_drive_performance_ch...

I did a quick scan of old computer magazines (infoworld, pc mag etc).

peter303 · on Jan 6, 2016

HP was designed a "flat memory" OS based on vast amounts of cheap memrister memory http://www.technologyreview.com/featuredstory/536786/machine.... But when I googled gfor this article I saw the project has been delayed.

david927 · on Jan 7, 2016

I'm very interested in working on a "flat memory" OS that doesn't use any RAM or file system but simply registers and a distributed database of key/value stores.

If you're interested in talking about this more (especially if you're in the SF Bay Area), my email is in my profile.

mozumder · on Jan 6, 2016

I'm seeing this on a new Postgres database server: my cache miss and cache hit queries take virtually the same time!

This is with a Skylake Xeon E3-1275, 64GB ECC UDIMM, and Intel 750 PCIe SSD (probably the fastest setup you can get).

It looks like I have to figure out something that tunes Postgres to account for the fact that disk lookups are no-cost.

anarazel · on Jan 7, 2016

> I'm seeing this on a new Postgres database server: my cache miss and cache hit queries take virtually the same time!

That's likely because it's hitting the OS's page cache.

brudgers · on Jan 7, 2016

Serious but naive question: does a bottleneck curve trending toward CPU subsystems suggest micro kernel based approaches replacing spinning up virtual machines as a future trend due to the possibility of reduced overhead at the CPU?

tldr; Does increasing use of Storage Class Memory imply increasing use of microkernals?

jsolson · on Jan 6, 2016

The numbers in this are daunting, but I personally believe massively multi-core systems make the problem a lot less daunting than the article makes out. Core counts in big servers can get up over 100 per server for Intel (see Amazon's new EC2 offerings for public evidence of this). Intel's Xeon Phi series of processors offer core counts approaching ~300. Going to 300x takes the required latencies per request from microseconds up to the millisecond range. POWER systems can go even higher. Moreover, for many workloads that actually leverage this sort of compute you can do something horrifying with the new DRAM-addressable persistent storage: DMA directly from the NIC into block storage. Some (many?) high performance network adapters offer the ability to filter packets to distinct Rx queues; buffers can be posted with addresses in the storage mapped region allowing direct NIC->storage transfer. If you bake more intelligence into the NIC, you can even do things like Mellanox's NVMe Fabrics:

http://www.mellanox.com/blog/2015/04/mangstor-mellanox-show-...

This is particularly relevant to the JBOD example.

Now, there's the question of what you're actually going to do with all of that data, but in a lot of cases it's likely a durable read-mostly cache that's effectively a materialized view optimized of some (hopefully much slower write-rate) transactional store (say, product data on Amazon -- detail pages served up at some absurdly high rate, but a relatively low mutation rate).

Other workloads I can think of fall into a category I tend to think of as log processing -- a high-rate series of streaming writes which are slurped up and batch processed/reconciled to some (much smaller) state (which of course may then be exploded back out to large materialized views as above). In these scenarios, presuming the log entries have low contention over the underlying state, CPUs like those I called out above are more than up to the task of streaming over the input and optimistically updating the backing state.

Finally, in terms of real workloads, there is almost always going to be a bottleneck limiting your ability to fully utilize your resources. Either you're CPU bound and leaving network bandwidth on the table or you're network bound and are leaving CPUs/storage devices under-utilized. Massively improved storage performance local to a node is fantastic in terms of computation you can do locally, but if each network fabric upgrade costs you 10x what the previous one did to keep up with the storage/CPU available per-node, you're going to have a bad time. Amin Vahdat talked a bit about our (Google's) historical network fabric evolution: https://www.youtube.com/watch?v=FaAZAII2x0w

If I were betting on an annoying bottleneck to full resource utilization coming up in the near future, I'd put my money on network before CPU :)

crudbug · on Jan 7, 2016

Increasing industry trend to expose CPU / GPU / SPU / NPU directly to applications for more efficient data handling.

thrownaway2424 · on Jan 6, 2016

Performant: still not a word.

pklausler · on Jan 6, 2016

Seymour Cray never said "performant". Engineers say "fast" or "fast enough". Marketing types and nontechnical management seem to prefer this neologism. But it might also be a generational thing.

A new coinage that I noticed in the past year that also grates on my ears: "learning" as a substitute for "lesson", as in "what were your learnings from the hackathon?" Anyone else caught this one?

hadagribble · on Jan 7, 2016

(One of the authors here)

This is by far my favourite post in this entire thread -- I don't think I've ever seen Seymour Cray referenced as an authority in this manner before :)

Not specifically to OP, but for everyone unhappy about the use of performant: Pretty happy to concede that it isn't our finest bit of writing. Pretentiousness? Lack of technical nous? A bit of hurried/careless wording? I'll let you folks decide...

douche · on Jan 6, 2016

"Learnings" sounds like either a Hinglish expression, or maybe Microspeak to my ears.

Scramblejams · on Jan 6, 2016

Also noticed it several years ago. I may be wrong but to me it seems to have arrived from the same group that's imposed corporate-speak on boardrooms everywhere, including high-profile retreadings of words like synergy and paradigm.

Agathos · on Jan 6, 2016

Yeah, to my ears it sounds like someone is anxious that he won't be taken seriously if he uses a one-syllable word.

windowsworkstoo · on Jan 6, 2016

Past year? Past decade champ.

pshc · on Jan 6, 2016

Maybe in Microsoft-land or certain circles, but anecdotally I've only started hearing "learnings" this year too.

cjensen · on Jan 6, 2016

This is the worst use of the neologism I've yet seen:

(1) It's not perfectly clear what the definition is in this particular context. Usually the word is used to indicate "our stuff is rad fast bro" implying that speed is obtained through cleverness such as the use of efficient code or an efficiently scalable architecture, but the linked article violates this definition by comparing apples (CPUs) and oranges (storage). It's nonsensical in the manner of "my word processor is more performant than my fractal renderer".

(2) Normally, use of this neologism saves time by replacing a long phrase. This use in the linked article is backwards: the word "faster" could have been used in place of the longer "more performant."

ralusek · on Jan 6, 2016

How do you think new words came about?

Scramblejams · on Jan 6, 2016

By ignoring perfectly good legitimate alternatives and instead inventing words in an effort to make the speaker sound smarter than the listener, of course.

ape4 · on Jan 6, 2016

eg "faster"

mikeash · on Jan 6, 2016