I've been rewriting Python->C for nearly 20 years now. The expected speedup is around 100x, or 1000x for numerical stuff or allocation-heavy work that can be done statically. Whenever you get 10,000x or above, it's because you've written a better algorithm. You can't generalize that. A 35k speedup is a cool demo but should be regarded as hype.
I wrote a simple Monte Carlo implementation in Python 3.11 and Rust. Python managed 10 million checks in a certain timeframe, while Rust was could perform 9 billion checks in the same timeframe.
That's about a 900x speedup, if I'm not mistaken. I suspect Mojo's advertised speedup is through the same process, except on benchmarks that are not dominated by syscalls (RNG calls in this instance).
The one difference was the Rust one used a parallel iterator (rayon + one liner change), whereas I have found Python to be more pain than it's worth, for most usecases.
You mean, what if you're only doing matrix stuff? Then it's probably easier to let numpy do the heavy lifting. You'll probably take less than a 5x performance hit, if you're doing numpy right. And if you're doing matrix multiplication, numpy will end up faster because it's backed by a BLAS, which mortals such as myself know better than to compete with.
> What's the biggest offender that you see?
Umm... every line of Python? Member access. Function calls. Dictionaries that can fundamentally be mapped to int-indexed arrays. Reference counting. Tuple allocation.
One fun exercise is to take your vanilla python code, compile it in Cython with the -a flag to produce an HTML annotation. Click on the yellowest lines, and it shows you the gory details of what Cython does to emulate CPython. It's not exactly what CPython is doing (for example, Cython elides the virtual machine), but it's close enough to see where time is spent. Put the same code through the python disassembler "dis" to see what virtual machine operations are emitted, and paw through the main evaluation loop [1]; or take a guided walkthrough at [2].
Due to the possibility to fuse multiple operations in C++ (whereas you often have intermediate arrays in numpy), I routinely get 20x speedups when porting from numpy to C++. Good libraries like eigen help a lot.
most non-trivial numpy operations require temporaries that require new allocations and copies. Eigen3's design lets you avoid these through clever compilation tricks while remaining high-level.
sometimes numpy can elide those (e.g. why a+=b is faster than a=a+b) but this it not possible in general. Sometimes people use monstrosities like einsum... but I find it more intuitive to just write in C or C++...
In addition to the time spent in allocation / gc / needless copying, the memory footprint can be higher by a factor of a few (or more...).
Yep, einsum is included in "doing numpy right." And for what it's worth, it's horrid to use and still won't get around cases like x -> cos(x). I haven't needed the power of eigen for a couple of years, but I appreciate the tip.
> numpy will end up faster because it's backed by a BLAS, which mortals such as myself know better than to compete with.
I'd like to dig a little here, for my own curiosity. How is this possible? Ie, beating C or Rust code using... arcane magic. It reminds me of React was touted as fast; I couldn't figure out how a Javascript lib could be faster than Javascript.
BLAS uses low level routines that are difficult to replicate in C. Some of the stuff is written in FORTRAN so as to avoid aliasing issues inherent to C arrays. Some implementations use direct assembly operations. It is heavily optimized by people who really know what they're doing when it comes to floating point operations.
BLAS are incredibly well optimized by people doing their life's work on just matrix multiplication, hand-tuning their assembly, benchmarking it per platform to optimize cache use, etc -- they are incredible feats of software engineering. For the multiplication of large matrices (cubic time), the performance gains can quickly overwhelm the quadratic-time overhead of the scripting language.
BLAS is a very well optimized library. I think a lot of it is in Fortran, which can be faster than c. It is very heavily used in scientific compute. BLAS also has methods that have been hand tuned in assembly. It’s not magic, but the amount of work that has gone into it is not something you would probably want to replicate.
> The expected speedup is around 100x, or 1000x for numerical stuff or allocation-heavy work that can be done statically. Whenever you get 10,000x or above, it's because you've written a better algorithm.
Anecdotally I recently rewrote a piece of Python code in Rust and got ~300x speedup, but let's be conservative and give it 100x. Now let's extrapolate from that. In native code you can use SIMD, and that can give you a 10x speedup, so now we're at 1000x. In native code you can also easily use multiple threads, so assuming a machine with a reasonably high number of cores, let's say 32 of them (because that's what I had for the last 4 years), we're now at 32000x speedup. So to me those are very realistic numbers, but of course assuming the problem you're solving can be sped up with SIMD and multiple threads, which is not always the case. So you're probably mostly right.
Trivially parallelizable algorithms are definitely in the "not generally applicable" regime. But you're right, they're capable of hitting arbitrarily large, hardware-dependent speedups. And that's definitely something a sufficiently intelligent compiler should be able to capture through dependency analysis.
Note that I don't doubt the 35k speedup -- I've seen speedups into the millions -- I'm just saying there's no way that can be a representative speedup that users should expect to see.
There's a serialization overhead both on dispatch and return that makes multiprocessing in Python unsuitable for some problems that would otherwise be solved well with threads in other languages.
The other languages are not taking/releasing a globally mutually exclusive GIL every time it crosses an API boundary and thus "shared nothing" in those languages is truly shared nothing. Additionally, Python's multiprocessing carries a lot of restrictions which makes it hard to pass more complex messages.
Somehow I'm still skeptical! Feels like "strict superset of Python, all Python code works" and "several orders of magnitude faster" just sounds like you're trying to have your cake and eat it, too.
I doubt that the Mojo developers have some sort of 'secret sauce' or 'special trick' that will get them there. And even if they have something, I don't see why the Python devs wouldn't just implement the same approach, considering they're currently trying to make Python faster.
I assume that (as long as Mojo wants to stick to its goal of being a strict superset of Python), there will be a lot of things that just cannot be 'fixed'. For example, I'd be surprised if Python's Global Interpreter Lock isn't entangled with the language in a few nasty ways that'd make it really difficult to replace or improve upon (while retaining compatibility).
Then again, the developer of Swift is working on it, right? I guess he's got the experience, at least.
I am a little skeptical as well, but I do think there is a lot of areas for improvement beyond the changes going in mainline. Note that while Mojo intends to support all of python, they never claimed that it will still be fast if you make heavy use of all the dynamic features. The real limiting factors are how often those dynamic features are used, and how frequently the code needs to be checking whether they are used.
The fast CPython changes are still very much interpreter-centric, and are checking whether assumptions have changed on every small operation. It seems to me that if you are able to JIT large chunks of code, and then push all the JIT invalidation checks into the dynamic features that break your assumptions rather than in the happy path that is using the assumptions, you ought to be able to get much closer to to Javascript levels of performance when dynamic features aren't being used.
Then support for those dynamic features becomes a fallback way of having a huge ecosystem from day one, even if it is only modestly faster than CPython.
IDK about 35,000x but Python really is outrageously slow relative to other popular languages, even other interpreted ones. It's more comparable to something like Ruby than to something like JavaScript.
From their FAQ: "How compatible is Mojo with Python really? Mojo already supports many core features of Python including async/await, error handling, variadics, etc, but… it is still very early and missing many features - so today it isn’t very compatible. Mojo doesn’t even support classes yet!".
Overall, pure Python seems to be about 100x slower than what you can reasonably get with a compiled language and some hard work. It's about 10x slower than what you can get from JITs like Pypy and Javascript, when such comparisons makes sense.
I agree that Mojo remind me of Cython, but with more marketing and less compatibility with Python. Cython aspired to be a nearly 99% superset of Python, at least that's exactly what I pushed my postdocs and graduate students to make it be back in 2009 (e.g., there were weeks of Robert Bradshaw and Craig Citro pushing each other to get closures to fully work). Mojo seems to be a similar modern idea for doing the same sort of thing. It could be great for the ecosystem; time will tell. Cython could still also improve a lot too -- there is a major new 3.0 release just around the corner!: https://pypi.org/project/Cython/#history
Well, they do have MLIR, which is probably the closest thing to "secret sauce" they've got. I'm excited by some of the performance-oriented features like tiling loops, but they'll need the multitude of optimization-hints that GCC and Clang have, too. They also have that parallel runtime, which seems similar to Cilk to me, and Python fundamentally can never have that as I understand it.
The way I read it, Python code as it is, won't see a huge bump it's like 10x or something.
You have to use special syntax and refactoring to get the 1000x speed ups. The secret sauce is essential a whole new language within the language which likely helps skip the GIL issues.
I think the point is that it's really hard to get those gains from "being compiled" when every function or method call, every use of an operator, everything is subject to multiple levels of dynamic dispatch and overrideability, where every variable can be any type and its type can change at any time and so has to be constantly checked.
A language that's designed to be compiled doesn't have these issues, since you can move most of the checking and resolution logic to compile-time. But doing that requires the language's semantics to be aligned to that goal, which Python's most certainly are not.
The achievements of Pypy are pretty incredible in the JIT space, but the immense effort that has gone in there definitely is a good reason to be skeptical of Mojo's claims of being both compiled and also a "superset" of Python.
So you want a system that figures out that the vast majority of function and method calls are not overridden, that these variables and fields are in practice always integers, etc.; generates some nice compiled code, and adds some interlock so that, if these assumptions do change, the code gets invalidated and replaced with new code (and maybe some data structures get patched).
I think that this kind of thing is in fact done with the high-performing Javascript engines used in major browsers. I imagine PyPy, being a JIT, is in a position to do this kind of thing. Perhaps Python is more difficult than Javascript, and/or perhaps a lot more effort has been put into the Javascript engines than into PyPy?
Python is considerably more dynamic than JavaScript. And I'm not trying to drag pypy, just saying that that's already the state of the art and there are good reasons that the authors of pypy didn't set out to do what Mojo apparently claims to have done.
> Python is considerably more dynamic than JavaScript
This is sometimes repeated but I don’t believe that is why Python is slow (nor do I think for most measures of “dynamic” it is even true). Which aspect of “dynamism” in particular are you concerned about that JavaScript lacks? The primary hindrance is keeping CPython extensions working while making Python fast. Add to that the hundreds of millions that went into V8 and other JavaScript implementation efforts.
For starters, consider that almost every operator in Python is a dynamic method call under the hood. This includes value equality and hashing, which also affects e.g. dict keys. Oh, and all built-in types can be extended and their behavior overridden.
Those are hardly unsolvable issues with a JIT as they are static for the vast majority of cases. Solutions have been proposed since the 90s with Self and Strongtalk[1].
Doing that, plus 100% compatibility with CPython extension API, while preserving some expectations of deterministic destruction in Python, are the challenges one would face.
The point is that the part of mojo that is new allows you to write early bindings that can't change, immutable variables, values that are accessed without indirection, types, ownership annotations, etc, and uses all of that to do precisely what you are asking for.
If it can't do much until those extra annotations are in place, then I would imagine it's unlikely it'll ever get support from numpy, django, and the like. Uptake for mypy type annotations alone has been pretty slow.
Mojo bothers offers Just-In-Time and Ahead-Of-Time models. Proper AOT compilation is going to offer a significant speedup, but probably not enough to get to their stated goals. And good luck carrying all of Python's dynamic features across the gap.
The concept of compiling python has been tried again and again. It has its moments, but anything remotely important is glued in from compiled code anyways.
From a quick reading of the Mojo website, it sounds like gluing in compiled code is exactly what they're doing, except this time the separate compiled language happens to look sort of like Python a bit. For the actual Python code it still uses CPython, so that part doesn't get any faster.
Cython works and is widely used. Problem is that there aren't too many people working on it (from what I can tell), so the language has some very rough edges. It seems like it has mostly filled the niche of "making good wrappers of C, C++, and Fortran libraries".
Cython has always been advertised as a potential alternative to writing straight Python, and there are probably a decent number of people who do this. I work in computational science and don't personally know anyone that does. I use it myself, but it's usually a big lift because of the rough edges of the language and the sparse and low quality documentation.
If Cython had 10x as many people working on it, it could be turned into something significantly more useful. I imagine there's a smarter way of approaching the problem these days rather than compiling to C. I hope the Mojo guys pull off "modern and actually good Cython"!
We use Cython a lot. Currently two biggest annoyances are the lack of tooling, in particular a language server and code formatter. Besides that, even though Cython looks a lot like Python, you need to have some familiarity with C or C++ to avoid shooting yourself in the foot and to check that the generated code is not suboptimal.
Cython's main benefit is very deep integration with Python (compared to eg. Rust and PyO3).
Cython was, of course, "modern and actually good Pyrex."
In the end the way the industry works guarantees a endless stream of these before some combination of boredom and rentiership result in each getting abandoned. It's just a question of whether Mojo lasts 1, 3, or god willing 5 years on top.
Is it the way the industry works, or is the way open source works? Pyrex had basically one person behind it (again, from what I can tell), and Cython currently has ~1 major person behind it. Not enough manpower to support such large endeavors.
Ideally, government or industry would get behind these projects and back them up. Evidently Microsoft is doing that for Python. For whatever reason, Cython and Pyrex both failed to attract that kind of attention. Hopefully it will be different with Mojo.
Sure, but we're talking about Cython as a proof of concept for Mojo, for which substantial perf gains are supposed to come from compilation even before you do anything special.
It has mostly failed, because contrary to the other dynamic languages, back to the early Lisp compilers, there is a community resistance to JIT adoption and most relevant refactoring C API on CPython.
In no way is Python more dynamic than Smalltalk, SELF or Common Lisp, which can at any given time redefine any object across the whole execution image and were/are mostly bootstraped environments.
There's Numba, CuPy, Jax and torch.compile. Arguably they are more like DSLs, which happen to integrate into Python than regular Python
Of course I don't know what Mojo will actually bring to the table since their documentation doesn't mention anything GPU specific, but the idea isn't completely novel.
That comparison seems quite cherry picked. Unlikely that it is generalizable. In my tests with Mojo, it doesn’t seem to behave like Python at all so far. Once they add the dynamism necessary we can see where they land. I’m still optimistic (mostly since Python has absolute garage performance and there’s low hanging fruit) but it’s no panacea. I feel their bet is not to have to run Python as is but have Python developers travel some distance and adapt to some constraints. They just need enough momentum to make that transition happen but it’s a transition to a distinct language, not a Python implementation. Sort of what Hack is to PHP.
[1] https://speed.python.org/comparison/?exe=12%2BL%2B3.11%2C12%...