I don't think any reverse debugging system can step the kernel backwards to this degree, unless they're doing something really clever (slow) with virtual machines and snapshots.
While not allowing stepping in the kernel, a large part of rr is indeed intercepting all things the kernel may do and re-implementing its actions, writing down all changes to memory & etc it does (of course for Linux, not Windows). With which the kernel doing an asynchronous write would necessarily end up as a part of the recording stating what the kernel writes at the given point in time, which a debugger could deterministically reason about. (of course this relies on the recording system reimplementing the things accurately enough, but that's at least possible)
You are correct. A time travel debugging solution that supports recording the relevant system call side effects would handle this. In fact, this system call is likely just rewriting the program counter register and maybe a few others, so it would likely be very easy to support if you could hook the relevant kernel operations which may or may not be possible in Windows.
The replay system would also be unlikely to pose a problem. Replay systems usually just encode and replay the side effects, so there is no need to "reimplement" the operations. So, if you did some wacky system call, but all it did is write 0x2 to a memory ___location, M, you effectively just record: "at time T we issued a system call that wrote 0x2 to M". Then, when you get to simulated time T in the replay, you do not reissue the wacky system call, you just write 0x2 to M and call it a day.
This system call returned and then asynchronously wrote to memory some time later. How does the replay system even know the write happened, without scanning all memory? It can't generally. With knowledge of the specific call it could put just that address on a to-be-scanned list to wait for completion, but it still needs to periodically poll the memory. It is far more complicated to record than a synchronous syscall.
You hook the kernel write. That is why I said hook the relevant kernel operations.
The primary complexity is actually in creating a consistent timeline with respect to parallel asynchronous writes. Record-Replay systems like rr usually just serialize multithreaded execution during recording to avoid such problems. You could also do so by just serializing the executing thread and the parallel asynchronous write by stopping execution of the thread while the write occurs.
Again, not really sure if that would be possible in Windows, but there is nothing particularly mechanically hard about doing this. It is just a question of whether it matches the abstractions and hooks Windows uses and supports.
I don't think rr hooks actual kernel writes, but rather just has hard-coded information on each syscall of how to compute what regions of memory it may modify, reading those on recording and writing on replay.
As such, for an asynchronous kernel write you'd want to set up the kernel to never mutate recordee memory, instead having it modify recorder-local memory, which the recorder can then copy over to the real process whenever, and get to record when it happens while at it (see https://github.com/rr-debugger/rr/issues/2613). But such can introduce large delays, thereby changing execution characteristics (if not make things appear to happen in a different order than the kernel would, if done improperly). And you still need the recording system to have accurately implemented the forwarding of whatever edge-case of the asynchronous operation you hit.
And, if done as just that, you'd still hit the problem encountered in the article of it looking like unrelated code changes the memory (whereas with synchronous syscalls you'd at least see the mutation happening on a syscall instruction). So you'd want some extra injected recordee instruction(s) to present separation of recordee actions from asynchronous kernel ones. As a sibling comment notes, rr as-is doesn't handle any asynchronous kernel write cases (though it's certainly not entirely impossible to).
It looks like they didn't actually need to step the kernel in the end - it just helped understand the bug (which I'd say was in user space - injecting an exception into select() and this preventing it exiting normally - even though a kernel behaviour was involved in how the bug manifested).
The time travel debugging available with WinDbg should be able to wind back to the point of corruption - that'd probably have taken a few days off the initial realisation that an async change to the stack was causing the problem.
There'd still be another reasoning step required to understand why that happened - but you would be able to step back in time e.g. to when this buffer was previously used on the stack to see how select () was submitting it to the kernel.
In fact, a data breakpoint / watchpoint could likely have taken you back from the corruption to the previous valid use, which may have been the missing piece.
When the assertion on the stack sentinel was reached they could have watched the value and then reverse continued, which in theory would reveal the APC causing the issue - or at least the instruction writing the value. Not sure how well reverse debugging works on Windows though, I'm only familiar with rr.