The actual logic requirements are pretty minimal (~1000 LUTs, maybe?), so it mostly depends on what kind of memory hierarchy you want. The ZedRipper relies entirely on internal BlockRAMs, which are a much more precious resource - I think I'm using all of the largest BlockRAMs, but only like ~3% of the logic resources in my chip. An easy solution to this is to replace the full 64KB of BlockRAM with a small cache, and then arbitrate for access to a large DRAM. So you wind up with a knob that you can turn to shift resources between more CPU cores with less non-cache memory bandwidth per core, or fewer CPU cores with more bandwidth. I went with the latter, but if I cranked the knob all the way in the other direction, I might be able to get somewhere in the range of 256-512 Z80 cores on this chip But with only ~1-2KB of cache per chip).
At that point, the performance will depend on the nature of your application - if it involves a tiny amount of RAM in terms of instructions and working set, the caching scheme will give you great performance - if each core needs full-speed, random access to its entire 64KB of memory, keeping everything on-die is the way to go.