It's a bit old and doesn't include recent microarchitectural changes, but Section 2 of the FlexSC paper from 2010 (https://www.usenix.org/legacy/event/osdi10/tech/full_papers/...) has a detailed discussion of these indirect effects. I especially like how they quantify the indirect effects by measuring user code's IPC after the syscall.
I think that's the benchmarks I allude to in the GP post.
Table 1 on page 3 is absolute gold, it quantifies the indirect costs by listing the number of cache lines and TLB entries evicted. The numbers are much larger than I remembered.
According to the table, the simplest syscall tested (stat) will evict 32 icache lines (L1), a few hundred dcache lines (L1), hundreds of L2 lines and thousands of L3 lines, and about twenty TLB entries.
After returning from said syscalls, you'll pay a cache miss for every line evicted.
Also worth noting that inside the syscall, the instructions per clock (IPC) is less than 0.5. When the CPU is happy, you generally see IPC figures around 2 to 3.
Yeah.. FlexSC / Soares is my favorite paper from OSDI 2010. The system call batching with "looped" multi-call they mention there relates to the roughly 30 line (not actually looping) assembly language in my other comment here (https://news.ycombinator.com/item?id=39189135) and in a few ways pre-saged io_uring work.
Anyway, a 20-line example of a program written against said interpreter is https://github.com/c-blake/batch/blob/1201eefc92da9121405b79... but that only needs the wdcpy fake syscall not the conditional jump forward (although that could/should be added if the open can succeed but the mmap can fail and you want the close clean-up also included in the batch, etc., etc.).
I believe Cassyopia (also mentioned in Soares) hoped to be able to analyze code in user-space with compiler techniques to automagically generate such programs, but I don't know that the work ever got beyond a HotOS paper (i.e. the kinda hopes & dreams stage) and it was never clear how fancy the anticipated batches being. The Xen/VMware multi-calls Soares2010 also mentions do not seem to have inline copy/jumps, though I'd be pretty surprised if that little kernel module is the only example of it.