I've been rewriting Python->C for nearly 20 years now. The expected speedup is a...

sk0g · on May 11, 2023

I wrote a simple Monte Carlo implementation in Python 3.11 and Rust. Python managed 10 million checks in a certain timeframe, while Rust was could perform 9 billion checks in the same timeframe. That's about a 900x speedup, if I'm not mistaken. I suspect Mojo's advertised speedup is through the same process, except on benchmarks that are not dominated by syscalls (RNG calls in this instance).

The one difference was the Rust one used a parallel iterator (rayon + one liner change), whereas I have found Python to be more pain than it's worth, for most usecases.

nomel · on May 11, 2023

> The expected speedup is around 100x, or 1000x for numerical stuff

What if you stay in the realm of numpy?

What's the biggest offender that you see?

klyrs · on May 11, 2023

> What if you stay in the realm of numpy?

You mean, what if you're only doing matrix stuff? Then it's probably easier to let numpy do the heavy lifting. You'll probably take less than a 5x performance hit, if you're doing numpy right. And if you're doing matrix multiplication, numpy will end up faster because it's backed by a BLAS, which mortals such as myself know better than to compete with.

> What's the biggest offender that you see?

Umm... every line of Python? Member access. Function calls. Dictionaries that can fundamentally be mapped to int-indexed arrays. Reference counting. Tuple allocation.

One fun exercise is to take your vanilla python code, compile it in Cython with the -a flag to produce an HTML annotation. Click on the yellowest lines, and it shows you the gory details of what Cython does to emulate CPython. It's not exactly what CPython is doing (for example, Cython elides the virtual machine), but it's close enough to see where time is spent. Put the same code through the python disassembler "dis" to see what virtual machine operations are emitted, and paw through the main evaluation loop [1]; or take a guided walkthrough at [2].

[1] https://github.com/python/cpython/blob/v3.6.14/Python/ceval.... (note this is an old version, you can change that in the url)

[2] https://leanpub.com/insidethepythonvirtualmachine/read

vbarrielle · on May 11, 2023

Due to the possibility to fuse multiple operations in C++ (whereas you often have intermediate arrays in numpy), I routinely get 20x speedups when porting from numpy to C++. Good libraries like eigen help a lot.

arjvik · on May 11, 2023

What's an example of fusing operations?

Are you talking about combinations of operations that are used commonly enough to warrant Eigen methods that perform them at once in SIMD?

cozzyd · on May 12, 2023

most non-trivial numpy operations require temporaries that require new allocations and copies. Eigen3's design lets you avoid these through clever compilation tricks while remaining high-level.

sometimes numpy can elide those (e.g. why a+=b is faster than a=a+b) but this it not possible in general. Sometimes people use monstrosities like einsum... but I find it more intuitive to just write in C or C++...

In addition to the time spent in allocation / gc / needless copying, the memory footprint can be higher by a factor of a few (or more...).

klyrs · on May 12, 2023

Yep, einsum is included in "doing numpy right." And for what it's worth, it's horrid to use and still won't get around cases like x -> cos(x). I haven't needed the power of eigen for a couple of years, but I appreciate the tip.

celrod · on May 11, 2023

Probably that eigen uses expression templates to avoid the needles creation of temporaries.

the__alchemist · on May 12, 2023

> numpy will end up faster because it's backed by a BLAS, which mortals such as myself know better than to compete with.

I'd like to dig a little here, for my own curiosity. How is this possible? Ie, beating C or Rust code using... arcane magic. It reminds me of React was touted as fast; I couldn't figure out how a Javascript lib could be faster than Javascript.

coliveira · on May 12, 2023

BLAS uses low level routines that are difficult to replicate in C. Some of the stuff is written in FORTRAN so as to avoid aliasing issues inherent to C arrays. Some implementations use direct assembly operations. It is heavily optimized by people who really know what they're doing when it comes to floating point operations.

klyrs · on May 12, 2023

BLAS are incredibly well optimized by people doing their life's work on just matrix multiplication, hand-tuning their assembly, benchmarking it per platform to optimize cache use, etc -- they are incredible feats of software engineering. For the multiplication of large matrices (cubic time), the performance gains can quickly overwhelm the quadratic-time overhead of the scripting language.

MobiusHorizons · on May 12, 2023

BLAS is a very well optimized library. I think a lot of it is in Fortran, which can be faster than c. It is very heavily used in scientific compute. BLAS also has methods that have been hand tuned in assembly. It’s not magic, but the amount of work that has gone into it is not something you would probably want to replicate.

m3affan · on May 11, 2023

This guy cythons

eslaught · on May 12, 2023

You can get on the order of 10-30x speedup over NumPy by reducing the allocation of temporaries and fusing across operations. See:

Weld: A Common Runtime for High Performance Data Analytics

https://dspace.mit.edu/bitstream/handle/1721.1/137425/cidr_w...

kouteiheika · on May 11, 2023

> The expected speedup is around 100x, or 1000x for numerical stuff or allocation-heavy work that can be done statically. Whenever you get 10,000x or above, it's because you've written a better algorithm.

Anecdotally I recently rewrote a piece of Python code in Rust and got ~300x speedup, but let's be conservative and give it 100x. Now let's extrapolate from that. In native code you can use SIMD, and that can give you a 10x speedup, so now we're at 1000x. In native code you can also easily use multiple threads, so assuming a machine with a reasonably high number of cores, let's say 32 of them (because that's what I had for the last 4 years), we're now at 32000x speedup. So to me those are very realistic numbers, but of course assuming the problem you're solving can be sped up with SIMD and multiple threads, which is not always the case. So you're probably mostly right.

klyrs · on May 11, 2023

Trivially parallelizable algorithms are definitely in the "not generally applicable" regime. But you're right, they're capable of hitting arbitrarily large, hardware-dependent speedups. And that's definitely something a sufficiently intelligent compiler should be able to capture through dependency analysis.

Note that I don't doubt the 35k speedup -- I've seen speedups into the millions -- I'm just saying there's no way that can be a representative speedup that users should expect to see.

piyh · on May 11, 2023

Python can use multiprocessing with a shared nothing architecture to use those 32 threads.

malodyets · on May 11, 2023

I was about to say the same thing.

Multiprocessing on Python works great and isn’t even very hard if you use say async_apply with a Pool.

Comparing single-threaded Python with multiprocesssing in Language X is unfair if not disingenuous.

mlyle · on May 11, 2023

> Multiprocessing on Python works great and isn’t even very hard if you use say async_apply with a Pool.

Multiprocessing works great if you don't really need a shared memory space for your task. If it's very loosely coupled, that's fine.

But if you use something that can benefit from real threading, Python clamps you to about 1.5-2.5 cores worth of throughput very often.

heavyset_go · on May 11, 2023

There's a serialization overhead both on dispatch and return that makes multiprocessing in Python unsuitable for some problems that would otherwise be solved well with threads in other languages.

mirekrusin · on May 12, 2023

Unless you don't need to change your code.

vlovich123 · on May 11, 2023

The other languages are not taking/releasing a globally mutually exclusive GIL every time it crosses an API boundary and thus "shared nothing" in those languages is truly shared nothing. Additionally, Python's multiprocessing carries a lot of restrictions which makes it hard to pass more complex messages.

coldtea · on May 11, 2023

And each of these threads will still have the Python interpreter performance.

Nothing preventing something like Mojo to also use those same 32 threads but with 10-100x the performance instead.

bombolo · on May 11, 2023

Hear me out… we can write bad python code to justify impressive speed boosts rewriting it in rust.

In this way we can justify rewriting stuff in rust to our bosses!

If we write decent python, and perhaps even replace 1 line to use pypy, the speedup won't be impressive and we won't get to play with rust!

3836293648 · on May 12, 2023

Iirc, the 35kx number included parallelisation

nerpderp82 · on May 12, 2023

Because they are targeting the GPU, write a GPGPU shader and call it from Python and you will get the same number.

Or use Jax or Taichi.