It's your project and you can call things whatever you like; still, the 8 free cores seem to me more like a "horde" than a "hoard".
Generally things hoarded are sequestered on a shelf against some imagined future need instead of put into service. At the same time, a "horde" usually numbers more than 8, except when treated ironically, after the manner of "three's a crowd". So I would suggest "horde", or "crowd". 8 is just a bit thin even for a crowd, unless it concerns a certain president's inauguration attendance.
Oops, changed to 'horde'! The distinction is mostly architectural, in that the 'horde' is composed of an undifferentiated mass of CPUs (currently 8, could probably be expanded up to 128 with a cache-based memory hierarchy, which would certainly feel more 'horde-like'). The application processors can be accessed individually, whereas the horde can only be accessed as a group.
What's the minimum workload that can be transferred to another processor for a speed gain?
For instance can you do little things like floating point xy + uv by bumping the subexpressions to separate units and have the parallelism outweigh the communication cost for a net gain.
That's an interesting question that I haven't explored much - the network on the ZedRipper is a unidirectional synchronous ring operating at the full 140 MHz, with a round-trip latency of ~32 clocks or so, but the interface exposed to the Z80 is a sort of re-targetable serial port (you write an 8-bit 'destination' register, and then you push bytes to that node). The current buffer depth on the receive side is only a single byte, so the sender needs to wait until the destination node has read the byte and the credit gets returned. Turbo Pascal uses the 'Real48' format for floating point - 6 bytes per number - and I believe floating point operations take several thousand clock cycles. So in a tight loop on both sides, you might transfer a floating point number to a neighboring node in ~500 cycles.
Especially if I improved the network a bit - deeper receive buffers at a minimum, maybe a simple DMA engine - you could probably get it down to <100 cycles to forward a Real48 to a neighbor. The performance of emulated floating point on an 8-bit CPU is sufficiently bad, and the network performance is sufficiently good, that you probably could get away with some very fine-grained parallelism that way! When I'm back to commuting, I should write an n-body gravity simulator or something for it so that there is lots of numerical work to spread around, and see how much of a speedup I can get.
I have a real Kaypro 2 computer with a 4MHz Z80 in it, which I also use Turbo Pascal on - on the Kaypro, it's perfectly usable, but you get used to waiting a few seconds when you're loading files, compiling, etc. On the ZedRipper, when things are executing out of RAM everything is instantaneous. I think the CPU core I'm using is close to cycle-accurate, so it probably is ~35x faster than the Kaypro when executing actual code.
Thank you. I absolutely refuse to get into retro computing. I refuse. I’m not going to. So I don’t think your amazing work has me at all interested. I have enough hobbies, and my wife knows it.
The actual logic requirements are pretty minimal (~1000 LUTs, maybe?), so it mostly depends on what kind of memory hierarchy you want. The ZedRipper relies entirely on internal BlockRAMs, which are a much more precious resource - I think I'm using all of the largest BlockRAMs, but only like ~3% of the logic resources in my chip. An easy solution to this is to replace the full 64KB of BlockRAM with a small cache, and then arbitrate for access to a large DRAM. So you wind up with a knob that you can turn to shift resources between more CPU cores with less non-cache memory bandwidth per core, or fewer CPU cores with more bandwidth. I went with the latter, but if I cranked the knob all the way in the other direction, I might be able to get somewhere in the range of 256-512 Z80 cores on this chip But with only ~1-2KB of cache per chip).
At that point, the performance will depend on the nature of your application - if it involves a tiny amount of RAM in terms of instructions and working set, the caching scheme will give you great performance - if each core needs full-speed, random access to its entire 64KB of memory, keeping everything on-die is the way to go.