You mean, what if you're only doing matrix stuff? Then it's probably easier to let numpy do the heavy lifting. You'll probably take less than a 5x performance hit, if you're doing numpy right. And if you're doing matrix multiplication, numpy will end up faster because it's backed by a BLAS, which mortals such as myself know better than to compete with.
> What's the biggest offender that you see?
Umm... every line of Python? Member access. Function calls. Dictionaries that can fundamentally be mapped to int-indexed arrays. Reference counting. Tuple allocation.
One fun exercise is to take your vanilla python code, compile it in Cython with the -a flag to produce an HTML annotation. Click on the yellowest lines, and it shows you the gory details of what Cython does to emulate CPython. It's not exactly what CPython is doing (for example, Cython elides the virtual machine), but it's close enough to see where time is spent. Put the same code through the python disassembler "dis" to see what virtual machine operations are emitted, and paw through the main evaluation loop [1]; or take a guided walkthrough at [2].
Due to the possibility to fuse multiple operations in C++ (whereas you often have intermediate arrays in numpy), I routinely get 20x speedups when porting from numpy to C++. Good libraries like eigen help a lot.
most non-trivial numpy operations require temporaries that require new allocations and copies. Eigen3's design lets you avoid these through clever compilation tricks while remaining high-level.
sometimes numpy can elide those (e.g. why a+=b is faster than a=a+b) but this it not possible in general. Sometimes people use monstrosities like einsum... but I find it more intuitive to just write in C or C++...
In addition to the time spent in allocation / gc / needless copying, the memory footprint can be higher by a factor of a few (or more...).
Yep, einsum is included in "doing numpy right." And for what it's worth, it's horrid to use and still won't get around cases like x -> cos(x). I haven't needed the power of eigen for a couple of years, but I appreciate the tip.
> numpy will end up faster because it's backed by a BLAS, which mortals such as myself know better than to compete with.
I'd like to dig a little here, for my own curiosity. How is this possible? Ie, beating C or Rust code using... arcane magic. It reminds me of React was touted as fast; I couldn't figure out how a Javascript lib could be faster than Javascript.
BLAS uses low level routines that are difficult to replicate in C. Some of the stuff is written in FORTRAN so as to avoid aliasing issues inherent to C arrays. Some implementations use direct assembly operations. It is heavily optimized by people who really know what they're doing when it comes to floating point operations.
BLAS are incredibly well optimized by people doing their life's work on just matrix multiplication, hand-tuning their assembly, benchmarking it per platform to optimize cache use, etc -- they are incredible feats of software engineering. For the multiplication of large matrices (cubic time), the performance gains can quickly overwhelm the quadratic-time overhead of the scripting language.
BLAS is a very well optimized library. I think a lot of it is in Fortran, which can be faster than c. It is very heavily used in scientific compute. BLAS also has methods that have been hand tuned in assembly. It’s not magic, but the amount of work that has gone into it is not something you would probably want to replicate.
What if you stay in the realm of numpy?
What's the biggest offender that you see?