This sounds interesting, but why should this be a new piece of hardware as opposed to a new OS service? Are these functions simply so specialized that implementing them in the OS would be a bottleneck (even though the CPU has plenty of free cycles)?
"This sounds interesting, but why should this be a new piece of hardware as opposed to a new OS service?"
Because the entire point is that CPUs are too slow by themselves, even without the OS, let alone with it. While you were context-switching into this OS server you missed the chance to do 10 IOPs give or take an order of magnitude.
Yet the OS really can't go anywhere. We can sometimes poke a hole here and sometimes poke a hole there but in general they're there for good reasons and not going anywhere, just as no matter what crazy things we bodge in to our computers "things like CPUs" aren't going anywhere either, and my guess is they're likely to stay pretty "central", too.
> While you were context-switching into this OS server you missed the chance to do 10 IOPs give or take an order of magnitude.
I'll believe that when I see real numbers.
A system call takes something like 54 ns on my laptop. With pwritev or similar, you can do quite a few IOs in a system call. (Of course, pwritev is slower than 54 ns, but that's not a fundamental constraint.)
An IO requires making the IO durable if you want it to be reliably persistent. So you have to do CLWB; SFENCE; PCOMMIT; SFENCE or whatever magic sequence you're using (depends on IO type and use of nontemporal instructions, (and you have to have hardware that supports that). If you're using NVMe instead of NVDIMMs, then you have to do an IO to sync with the controller, and that IO will be uncached.
Uncached IO is slow. PCOMMIT has unknown performance since no one has the hardware yet. System calls are fast.
The syscall overhead isn't the problem, that's dirt cheap as you say. The problem is the context-switch overhead. Calling into the OS flushes a lot of data and instructions from the cache, and that lost performance after returning can easily add up to around 30µs.[1]
>> No code in the OS needs access to the user code or data cache
This is not true for the data. How do you pass any data structures outside CPU registers then, say the path to a file to open. Normally it's a char*[0] (indeed passed in a register) but then the OS actually reads the data off the process memory (L1 data cache usually)
I'd say a cache isn't the right structure for passing around data (or references to data) that you know will be accessed very soon by completely different code.
As userland code, you'd like to grant the OS access to a particular subset of lines in D$ while keeping it out of your C$ altogether. Traditional implementations fail in both respects.... and at the same time, the OS probably can't take advantage of its own historical locality because userland has evicted it since the last call.
From what other people are saying it sounds like these problems are being worked on, though.
I don't think CAT can be used to partition kernel and userspace -- I'm not even sure how you'd go about doing that given that you can (and do) have shared pages between them.
That being said, from our experiments, if you're using userspace network and NVMe drivers the context switch and associated cache pollution is not a problem, since it is happening pretty infrequently (primarily just timer interrupts, and those can be turned off, but we haven't needed to).
One of the other constraints is the actual data copy. I don't have any benchmarks on hand, but you pay the penalty for the copy, potentially cache misses, and the potential TLB miss. Obviously, there are ways to avoid it without resorting to bypassing the kernel, but there's still a non-negligible cost.
Maybe it would be beneficial to have a coherent interface as well, considering NVMe.
Well the article does say: "Our own experience has been that efforts to saturate PCIe flash devices often require optimizations to existing storage subsystems, and then consume large amounts of CPU cycles. "