The syscall overhead isn't the problem, that's dirt cheap as you say. The problem is the context-switch overhead. Calling into the OS flushes a lot of data and instructions from the cache, and that lost performance after returning can easily add up to around 30µs.[1]
>> No code in the OS needs access to the user code or data cache
This is not true for the data. How do you pass any data structures outside CPU registers then, say the path to a file to open. Normally it's a char*[0] (indeed passed in a register) but then the OS actually reads the data off the process memory (L1 data cache usually)
I'd say a cache isn't the right structure for passing around data (or references to data) that you know will be accessed very soon by completely different code.
As userland code, you'd like to grant the OS access to a particular subset of lines in D$ while keeping it out of your C$ altogether. Traditional implementations fail in both respects.... and at the same time, the OS probably can't take advantage of its own historical locality because userland has evicted it since the last call.
From what other people are saying it sounds like these problems are being worked on, though.
I don't think CAT can be used to partition kernel and userspace -- I'm not even sure how you'd go about doing that given that you can (and do) have shared pages between them.
That being said, from our experiments, if you're using userspace network and NVMe drivers the context switch and associated cache pollution is not a problem, since it is happening pretty infrequently (primarily just timer interrupts, and those can be turned off, but we haven't needed to).
[1] http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-ma...