That's an interesting question that I haven't explored much - the network on the...

That's an interesting question that I haven't explored much - the network on the ZedRipper is a unidirectional synchronous ring operating at the full 140 MHz, with a round-trip latency of ~32 clocks or so, but the interface exposed to the Z80 is a sort of re-targetable serial port (you write an 8-bit 'destination' register, and then you push bytes to that node). The current buffer depth on the receive side is only a single byte, so the sender needs to wait until the destination node has read the byte and the credit gets returned. Turbo Pascal uses the 'Real48' format for floating point - 6 bytes per number - and I believe floating point operations take several thousand clock cycles. So in a tight loop on both sides, you might transfer a floating point number to a neighboring node in ~500 cycles.

Especially if I improved the network a bit - deeper receive buffers at a minimum, maybe a simple DMA engine - you could probably get it down to <100 cycles to forward a Real48 to a neighbor. The performance of emulated floating point on an 8-bit CPU is sufficiently bad, and the network performance is sufficiently good, that you probably could get away with some very fine-grained parallelism that way! When I'm back to commuting, I should write an n-body gravity simulator or something for it so that there is lots of numerical work to spread around, and see how much of a speedup I can get.