> The expected speedup is around 100x, or 1000x for numerical stuff or allocation-heavy work that can be done statically. Whenever you get 10,000x or above, it's because you've written a better algorithm.
Anecdotally I recently rewrote a piece of Python code in Rust and got ~300x speedup, but let's be conservative and give it 100x. Now let's extrapolate from that. In native code you can use SIMD, and that can give you a 10x speedup, so now we're at 1000x. In native code you can also easily use multiple threads, so assuming a machine with a reasonably high number of cores, let's say 32 of them (because that's what I had for the last 4 years), we're now at 32000x speedup. So to me those are very realistic numbers, but of course assuming the problem you're solving can be sped up with SIMD and multiple threads, which is not always the case. So you're probably mostly right.
Trivially parallelizable algorithms are definitely in the "not generally applicable" regime. But you're right, they're capable of hitting arbitrarily large, hardware-dependent speedups. And that's definitely something a sufficiently intelligent compiler should be able to capture through dependency analysis.
Note that I don't doubt the 35k speedup -- I've seen speedups into the millions -- I'm just saying there's no way that can be a representative speedup that users should expect to see.
There's a serialization overhead both on dispatch and return that makes multiprocessing in Python unsuitable for some problems that would otherwise be solved well with threads in other languages.
The other languages are not taking/releasing a globally mutually exclusive GIL every time it crosses an API boundary and thus "shared nothing" in those languages is truly shared nothing. Additionally, Python's multiprocessing carries a lot of restrictions which makes it hard to pass more complex messages.
Anecdotally I recently rewrote a piece of Python code in Rust and got ~300x speedup, but let's be conservative and give it 100x. Now let's extrapolate from that. In native code you can use SIMD, and that can give you a 10x speedup, so now we're at 1000x. In native code you can also easily use multiple threads, so assuming a machine with a reasonably high number of cores, let's say 32 of them (because that's what I had for the last 4 years), we're now at 32000x speedup. So to me those are very realistic numbers, but of course assuming the problem you're solving can be sped up with SIMD and multiple threads, which is not always the case. So you're probably mostly right.