> The expected speedup is around 100x, or 1000x for numerical stuff What if you ...

klyrs · on May 11, 2023

> What if you stay in the realm of numpy?

You mean, what if you're only doing matrix stuff? Then it's probably easier to let numpy do the heavy lifting. You'll probably take less than a 5x performance hit, if you're doing numpy right. And if you're doing matrix multiplication, numpy will end up faster because it's backed by a BLAS, which mortals such as myself know better than to compete with.

> What's the biggest offender that you see?

Umm... every line of Python? Member access. Function calls. Dictionaries that can fundamentally be mapped to int-indexed arrays. Reference counting. Tuple allocation.

One fun exercise is to take your vanilla python code, compile it in Cython with the -a flag to produce an HTML annotation. Click on the yellowest lines, and it shows you the gory details of what Cython does to emulate CPython. It's not exactly what CPython is doing (for example, Cython elides the virtual machine), but it's close enough to see where time is spent. Put the same code through the python disassembler "dis" to see what virtual machine operations are emitted, and paw through the main evaluation loop [1]; or take a guided walkthrough at [2].

[1] https://github.com/python/cpython/blob/v3.6.14/Python/ceval.... (note this is an old version, you can change that in the url)

[2] https://leanpub.com/insidethepythonvirtualmachine/read

vbarrielle · on May 11, 2023

Due to the possibility to fuse multiple operations in C++ (whereas you often have intermediate arrays in numpy), I routinely get 20x speedups when porting from numpy to C++. Good libraries like eigen help a lot.

arjvik · on May 11, 2023

What's an example of fusing operations?

Are you talking about combinations of operations that are used commonly enough to warrant Eigen methods that perform them at once in SIMD?

cozzyd · on May 12, 2023

most non-trivial numpy operations require temporaries that require new allocations and copies. Eigen3's design lets you avoid these through clever compilation tricks while remaining high-level.

sometimes numpy can elide those (e.g. why a+=b is faster than a=a+b) but this it not possible in general. Sometimes people use monstrosities like einsum... but I find it more intuitive to just write in C or C++...

In addition to the time spent in allocation / gc / needless copying, the memory footprint can be higher by a factor of a few (or more...).

klyrs · on May 12, 2023

Yep, einsum is included in "doing numpy right." And for what it's worth, it's horrid to use and still won't get around cases like x -> cos(x). I haven't needed the power of eigen for a couple of years, but I appreciate the tip.

celrod · on May 11, 2023

Probably that eigen uses expression templates to avoid the needles creation of temporaries.

the__alchemist · on May 12, 2023

> numpy will end up faster because it's backed by a BLAS, which mortals such as myself know better than to compete with.

I'd like to dig a little here, for my own curiosity. How is this possible? Ie, beating C or Rust code using... arcane magic. It reminds me of React was touted as fast; I couldn't figure out how a Javascript lib could be faster than Javascript.

coliveira · on May 12, 2023

BLAS uses low level routines that are difficult to replicate in C. Some of the stuff is written in FORTRAN so as to avoid aliasing issues inherent to C arrays. Some implementations use direct assembly operations. It is heavily optimized by people who really know what they're doing when it comes to floating point operations.

klyrs · on May 12, 2023

BLAS are incredibly well optimized by people doing their life's work on just matrix multiplication, hand-tuning their assembly, benchmarking it per platform to optimize cache use, etc -- they are incredible feats of software engineering. For the multiplication of large matrices (cubic time), the performance gains can quickly overwhelm the quadratic-time overhead of the scripting language.

MobiusHorizons · on May 12, 2023

BLAS is a very well optimized library. I think a lot of it is in Fortran, which can be faster than c. It is very heavily used in scientific compute. BLAS also has methods that have been hand tuned in assembly. It’s not magic, but the amount of work that has gone into it is not something you would probably want to replicate.

m3affan · on May 11, 2023

This guy cythons

eslaught · on May 12, 2023

You can get on the order of 10-30x speedup over NumPy by reducing the allocation of temporaries and fusing across operations. See:

Weld: A Common Runtime for High Performance Data Analytics

https://dspace.mit.edu/bitstream/handle/1721.1/137425/cidr_w...