This looks like the perfect job for an FPGA (with a fast enough interconnect, poster above mentions CAPI which sits on top of PCIe but I have not had a chance to try it out yet).
I'm probably the poster above. ;-) Yes, we layer on top of PCIe for the physical transport, but once an adapter's in CAPI mode, it's able to do translations, participate in locks, and looks more or less like a slightly-strange other thread as far as code running on the main CPU is concerned.
Since the logic inside the accelerator can do pointer chasing, it can communicate directly with the application and bypass a lot of the stuff that happens when a normal IO occurs to other FPGAs today.