I don't know, looks like ~15% improvement [1]. Doubt the Mojo guys are going to ...

klyrs · on May 11, 2023

I've been rewriting Python->C for nearly 20 years now. The expected speedup is around 100x, or 1000x for numerical stuff or allocation-heavy work that can be done statically. Whenever you get 10,000x or above, it's because you've written a better algorithm. You can't generalize that. A 35k speedup is a cool demo but should be regarded as hype.

sk0g · on May 11, 2023

I wrote a simple Monte Carlo implementation in Python 3.11 and Rust. Python managed 10 million checks in a certain timeframe, while Rust was could perform 9 billion checks in the same timeframe. That's about a 900x speedup, if I'm not mistaken. I suspect Mojo's advertised speedup is through the same process, except on benchmarks that are not dominated by syscalls (RNG calls in this instance).

The one difference was the Rust one used a parallel iterator (rayon + one liner change), whereas I have found Python to be more pain than it's worth, for most usecases.

nomel · on May 11, 2023

> The expected speedup is around 100x, or 1000x for numerical stuff

What if you stay in the realm of numpy?

What's the biggest offender that you see?

klyrs · on May 11, 2023

> What if you stay in the realm of numpy?

You mean, what if you're only doing matrix stuff? Then it's probably easier to let numpy do the heavy lifting. You'll probably take less than a 5x performance hit, if you're doing numpy right. And if you're doing matrix multiplication, numpy will end up faster because it's backed by a BLAS, which mortals such as myself know better than to compete with.

> What's the biggest offender that you see?

Umm... every line of Python? Member access. Function calls. Dictionaries that can fundamentally be mapped to int-indexed arrays. Reference counting. Tuple allocation.

One fun exercise is to take your vanilla python code, compile it in Cython with the -a flag to produce an HTML annotation. Click on the yellowest lines, and it shows you the gory details of what Cython does to emulate CPython. It's not exactly what CPython is doing (for example, Cython elides the virtual machine), but it's close enough to see where time is spent. Put the same code through the python disassembler "dis" to see what virtual machine operations are emitted, and paw through the main evaluation loop [1]; or take a guided walkthrough at [2].

[1] https://github.com/python/cpython/blob/v3.6.14/Python/ceval.... (note this is an old version, you can change that in the url)

[2] https://leanpub.com/insidethepythonvirtualmachine/read

vbarrielle · on May 11, 2023

Due to the possibility to fuse multiple operations in C++ (whereas you often have intermediate arrays in numpy), I routinely get 20x speedups when porting from numpy to C++. Good libraries like eigen help a lot.

arjvik · on May 11, 2023

What's an example of fusing operations?

Are you talking about combinations of operations that are used commonly enough to warrant Eigen methods that perform them at once in SIMD?

cozzyd · on May 12, 2023

most non-trivial numpy operations require temporaries that require new allocations and copies. Eigen3's design lets you avoid these through clever compilation tricks while remaining high-level.

sometimes numpy can elide those (e.g. why a+=b is faster than a=a+b) but this it not possible in general. Sometimes people use monstrosities like einsum... but I find it more intuitive to just write in C or C++...

In addition to the time spent in allocation / gc / needless copying, the memory footprint can be higher by a factor of a few (or more...).

klyrs · on May 12, 2023

Yep, einsum is included in "doing numpy right." And for what it's worth, it's horrid to use and still won't get around cases like x -> cos(x). I haven't needed the power of eigen for a couple of years, but I appreciate the tip.

celrod · on May 11, 2023

Probably that eigen uses expression templates to avoid the needles creation of temporaries.

the__alchemist · on May 12, 2023

> numpy will end up faster because it's backed by a BLAS, which mortals such as myself know better than to compete with.

I'd like to dig a little here, for my own curiosity. How is this possible? Ie, beating C or Rust code using... arcane magic. It reminds me of React was touted as fast; I couldn't figure out how a Javascript lib could be faster than Javascript.

coliveira · on May 12, 2023

BLAS uses low level routines that are difficult to replicate in C. Some of the stuff is written in FORTRAN so as to avoid aliasing issues inherent to C arrays. Some implementations use direct assembly operations. It is heavily optimized by people who really know what they're doing when it comes to floating point operations.

klyrs · on May 12, 2023

BLAS are incredibly well optimized by people doing their life's work on just matrix multiplication, hand-tuning their assembly, benchmarking it per platform to optimize cache use, etc -- they are incredible feats of software engineering. For the multiplication of large matrices (cubic time), the performance gains can quickly overwhelm the quadratic-time overhead of the scripting language.

MobiusHorizons · on May 12, 2023

BLAS is a very well optimized library. I think a lot of it is in Fortran, which can be faster than c. It is very heavily used in scientific compute. BLAS also has methods that have been hand tuned in assembly. It’s not magic, but the amount of work that has gone into it is not something you would probably want to replicate.

m3affan · on May 11, 2023

This guy cythons

eslaught · on May 12, 2023

You can get on the order of 10-30x speedup over NumPy by reducing the allocation of temporaries and fusing across operations. See:

Weld: A Common Runtime for High Performance Data Analytics

https://dspace.mit.edu/bitstream/handle/1721.1/137425/cidr_w...

kouteiheika · on May 11, 2023

> The expected speedup is around 100x, or 1000x for numerical stuff or allocation-heavy work that can be done statically. Whenever you get 10,000x or above, it's because you've written a better algorithm.

Anecdotally I recently rewrote a piece of Python code in Rust and got ~300x speedup, but let's be conservative and give it 100x. Now let's extrapolate from that. In native code you can use SIMD, and that can give you a 10x speedup, so now we're at 1000x. In native code you can also easily use multiple threads, so assuming a machine with a reasonably high number of cores, let's say 32 of them (because that's what I had for the last 4 years), we're now at 32000x speedup. So to me those are very realistic numbers, but of course assuming the problem you're solving can be sped up with SIMD and multiple threads, which is not always the case. So you're probably mostly right.

klyrs · on May 11, 2023

Trivially parallelizable algorithms are definitely in the "not generally applicable" regime. But you're right, they're capable of hitting arbitrarily large, hardware-dependent speedups. And that's definitely something a sufficiently intelligent compiler should be able to capture through dependency analysis.

Note that I don't doubt the 35k speedup -- I've seen speedups into the millions -- I'm just saying there's no way that can be a representative speedup that users should expect to see.

piyh · on May 11, 2023

Python can use multiprocessing with a shared nothing architecture to use those 32 threads.

malodyets · on May 11, 2023

I was about to say the same thing.

Multiprocessing on Python works great and isn’t even very hard if you use say async_apply with a Pool.

Comparing single-threaded Python with multiprocesssing in Language X is unfair if not disingenuous.

mlyle · on May 11, 2023

> Multiprocessing on Python works great and isn’t even very hard if you use say async_apply with a Pool.

Multiprocessing works great if you don't really need a shared memory space for your task. If it's very loosely coupled, that's fine.

But if you use something that can benefit from real threading, Python clamps you to about 1.5-2.5 cores worth of throughput very often.

heavyset_go · on May 11, 2023

There's a serialization overhead both on dispatch and return that makes multiprocessing in Python unsuitable for some problems that would otherwise be solved well with threads in other languages.

mirekrusin · on May 12, 2023

Unless you don't need to change your code.

vlovich123 · on May 11, 2023

The other languages are not taking/releasing a globally mutually exclusive GIL every time it crosses an API boundary and thus "shared nothing" in those languages is truly shared nothing. Additionally, Python's multiprocessing carries a lot of restrictions which makes it hard to pass more complex messages.

coldtea · on May 11, 2023

And each of these threads will still have the Python interpreter performance.

Nothing preventing something like Mojo to also use those same 32 threads but with 10-100x the performance instead.

bombolo · on May 11, 2023

Hear me out… we can write bad python code to justify impressive speed boosts rewriting it in rust.

In this way we can justify rewriting stuff in rust to our bosses!

If we write decent python, and perhaps even replace 1 line to use pypy, the speedup won't be impressive and we won't get to play with rust!

3836293648 · on May 12, 2023

Iirc, the 35kx number included parallelisation

nerpderp82 · on May 12, 2023

Because they are targeting the GPU, write a GPGPU shader and call it from Python and you will get the same number.

Or use Jax or Taichi.

TwentyPosts · on May 11, 2023

Somehow I'm still skeptical! Feels like "strict superset of Python, all Python code works" and "several orders of magnitude faster" just sounds like you're trying to have your cake and eat it, too.

I doubt that the Mojo developers have some sort of 'secret sauce' or 'special trick' that will get them there. And even if they have something, I don't see why the Python devs wouldn't just implement the same approach, considering they're currently trying to make Python faster.

I assume that (as long as Mojo wants to stick to its goal of being a strict superset of Python), there will be a lot of things that just cannot be 'fixed'. For example, I'd be surprised if Python's Global Interpreter Lock isn't entangled with the language in a few nasty ways that'd make it really difficult to replace or improve upon (while retaining compatibility).

Then again, the developer of Swift is working on it, right? I guess he's got the experience, at least.

pavon · on May 11, 2023

I am a little skeptical as well, but I do think there is a lot of areas for improvement beyond the changes going in mainline. Note that while Mojo intends to support all of python, they never claimed that it will still be fast if you make heavy use of all the dynamic features. The real limiting factors are how often those dynamic features are used, and how frequently the code needs to be checking whether they are used.

The fast CPython changes are still very much interpreter-centric, and are checking whether assumptions have changed on every small operation. It seems to me that if you are able to JIT large chunks of code, and then push all the JIT invalidation checks into the dynamic features that break your assumptions rather than in the happy path that is using the assumptions, you ought to be able to get much closer to to Javascript levels of performance when dynamic features aren't being used.

Then support for those dynamic features becomes a fallback way of having a huge ecosystem from day one, even if it is only modestly faster than CPython.

umanwizard · on May 11, 2023

IDK about 35,000x but Python really is outrageously slow relative to other popular languages, even other interpreted ones. It's more comparable to something like Ruby than to something like JavaScript.

williamstein · on May 11, 2023

From their FAQ: "How compatible is Mojo with Python really? Mojo already supports many core features of Python including async/await, error handling, variadics, etc, but… it is still very early and missing many features - so today it isn’t very compatible. Mojo doesn’t even support classes yet!".

Overall, pure Python seems to be about 100x slower than what you can reasonably get with a compiled language and some hard work. It's about 10x slower than what you can get from JITs like Pypy and Javascript, when such comparisons makes sense.

I agree that Mojo remind me of Cython, but with more marketing and less compatibility with Python. Cython aspired to be a nearly 99% superset of Python, at least that's exactly what I pushed my postdocs and graduate students to make it be back in 2009 (e.g., there were weeks of Robert Bradshaw and Craig Citro pushing each other to get closures to fully work). Mojo seems to be a similar modern idea for doing the same sort of thing. It could be great for the ecosystem; time will tell. Cython could still also improve a lot too -- there is a major new 3.0 release just around the corner!: https://pypi.org/project/Cython/#history

Conscat · on May 11, 2023

Well, they do have MLIR, which is probably the closest thing to "secret sauce" they've got. I'm excited by some of the performance-oriented features like tiling loops, but they'll need the multitude of optimization-hints that GCC and Clang have, too. They also have that parallel runtime, which seems similar to Cilk to me, and Python fundamentally can never have that as I understand it.

spprashant · on May 11, 2023

The way I read it, Python code as it is, won't see a huge bump it's like 10x or something.

You have to use special syntax and refactoring to get the 1000x speed ups. The secret sauce is essential a whole new language within the language which likely helps skip the GIL issues.

jck · on May 12, 2023

Do they claim that existing python code would get a 10x bump? That sounds too good to be true

wiseowise · on May 11, 2023

Isn’t Mojo compiled? While Python needs to be interpreted.

mikepurvis · on May 11, 2023

I think the point is that it's really hard to get those gains from "being compiled" when every function or method call, every use of an operator, everything is subject to multiple levels of dynamic dispatch and overrideability, where every variable can be any type and its type can change at any time and so has to be constantly checked.

A language that's designed to be compiled doesn't have these issues, since you can move most of the checking and resolution logic to compile-time. But doing that requires the language's semantics to be aligned to that goal, which Python's most certainly are not.

The achievements of Pypy are pretty incredible in the JIT space, but the immense effort that has gone in there definitely is a good reason to be skeptical of Mojo's claims of being both compiled and also a "superset" of Python.

waterhouse · on May 11, 2023

So you want a system that figures out that the vast majority of function and method calls are not overridden, that these variables and fields are in practice always integers, etc.; generates some nice compiled code, and adds some interlock so that, if these assumptions do change, the code gets invalidated and replaced with new code (and maybe some data structures get patched).

I think that this kind of thing is in fact done with the high-performing Javascript engines used in major browsers. I imagine PyPy, being a JIT, is in a position to do this kind of thing. Perhaps Python is more difficult than Javascript, and/or perhaps a lot more effort has been put into the Javascript engines than into PyPy?

mikepurvis · on May 12, 2023

Python is considerably more dynamic than JavaScript. And I'm not trying to drag pypy, just saying that that's already the state of the art and there are good reasons that the authors of pypy didn't set out to do what Mojo apparently claims to have done.

tgma · on May 12, 2023

> Python is considerably more dynamic than JavaScript

This is sometimes repeated but I don’t believe that is why Python is slow (nor do I think for most measures of “dynamic” it is even true). Which aspect of “dynamism” in particular are you concerned about that JavaScript lacks? The primary hindrance is keeping CPython extensions working while making Python fast. Add to that the hundreds of millions that went into V8 and other JavaScript implementation efforts.

int_19h · on May 12, 2023

For starters, consider that almost every operator in Python is a dynamic method call under the hood. This includes value equality and hashing, which also affects e.g. dict keys. Oh, and all built-in types can be extended and their behavior overridden.

And then there's descriptors: https://docs.python.org/3/howto/descriptor.html

tgma · on May 13, 2023

Those are hardly unsolvable issues with a JIT as they are static for the vast majority of cases. Solutions have been proposed since the 90s with Self and Strongtalk[1].

Doing that, plus 100% compatibility with CPython extension API, while preserving some expectations of deterministic destruction in Python, are the challenges one would face.

[1]: http://www.strongtalk.org/

jjnoakes · on May 12, 2023

The point is that the part of mojo that is new allows you to write early bindings that can't change, immutable variables, values that are accessed without indirection, types, ownership annotations, etc, and uses all of that to do precisely what you are asking for.

mikepurvis · on May 12, 2023

If it can't do much until those extra annotations are in place, then I would imagine it's unlikely it'll ever get support from numpy, django, and the like. Uptake for mypy type annotations alone has been pretty slow.

carlmr · on May 12, 2023

>Uptake for mypy type annotations alone has been pretty slow.

Because they're annotations with bad checkers. They're mostly helpful as documentation.

TypeScript checking is on a whole different level.

TwentyPosts · on May 11, 2023

Mojo bothers offers Just-In-Time and Ahead-Of-Time models. Proper AOT compilation is going to offer a significant speedup, but probably not enough to get to their stated goals. And good luck carrying all of Python's dynamic features across the gap.

pjmlp · on May 12, 2023

Python can be JIT compiled, so far most efforts failed short due to CPython FFI API.

Plenty of other dynamic languages have proven the point already.

mirekrusin · on May 12, 2023

...Swift and creator of LLVM and Clang.

KeplerBoy · on May 11, 2023

I don't buy the mojo hype.

The concept of compiling python has been tried again and again. It has its moments, but anything remotely important is glued in from compiled code anyways.

ptx · on May 11, 2023

From a quick reading of the Mojo website, it sounds like gluing in compiled code is exactly what they're doing, except this time the separate compiled language happens to look sort of like Python a bit. For the actual Python code it still uses CPython, so that part doesn't get any faster.

morelisp · on May 11, 2023

Pyrex, Cython, mypyc, Codon... we've been down this road before plenty too.

If Mojo succeeds it will be purely based on quality of implementation, not a spark of genius.

sfpotter · on May 11, 2023

Cython works and is widely used. Problem is that there aren't too many people working on it (from what I can tell), so the language has some very rough edges. It seems like it has mostly filled the niche of "making good wrappers of C, C++, and Fortran libraries".

Cython has always been advertised as a potential alternative to writing straight Python, and there are probably a decent number of people who do this. I work in computational science and don't personally know anyone that does. I use it myself, but it's usually a big lift because of the rough edges of the language and the sparse and low quality documentation.

If Cython had 10x as many people working on it, it could be turned into something significantly more useful. I imagine there's a smarter way of approaching the problem these days rather than compiling to C. I hope the Mojo guys pull off "modern and actually good Cython"!

danieldk · on May 12, 2023

We use Cython a lot. Currently two biggest annoyances are the lack of tooling, in particular a language server and code formatter. Besides that, even though Cython looks a lot like Python, you need to have some familiarity with C or C++ to avoid shooting yourself in the foot and to check that the generated code is not suboptimal.

Cython's main benefit is very deep integration with Python (compared to eg. Rust and PyO3).

KeplerBoy · on May 11, 2023

There's also Numba, which seems a lot more active and compiles directly to LLVM IR.

morelisp · on May 11, 2023

Cython was, of course, "modern and actually good Pyrex."

In the end the way the industry works guarantees a endless stream of these before some combination of boredom and rentiership result in each getting abandoned. It's just a question of whether Mojo lasts 1, 3, or god willing 5 years on top.

sfpotter · on May 11, 2023

Is it the way the industry works, or is the way open source works? Pyrex had basically one person behind it (again, from what I can tell), and Cython currently has ~1 major person behind it. Not enough manpower to support such large endeavors.

Ideally, government or industry would get behind these projects and back them up. Evidently Microsoft is doing that for Python. For whatever reason, Cython and Pyrex both failed to attract that kind of attention. Hopefully it will be different with Mojo.

Here's to 5 years!

int_19h · on May 12, 2023

Cython is also not actually that much faster if you just compile vanilla Python code with it.

sfpotter · on May 13, 2023

Expressly not the purpose of Cython.

int_19h · on May 13, 2023

Sure, but we're talking about Cython as a proof of concept for Mojo, for which substantial perf gains are supposed to come from compilation even before you do anything special.

pdpi · on May 11, 2023

The "spark of genius" might very well be a novel implementation strategy.

neolefty · on May 11, 2023

It leverages the Multi-Level Intermediate Representation (MLIR) https://mlir.llvm.org/ which is a follow-on from LLVM, with lots of lessons learned: https://www.hpcwire.com/2021/12/27/lessons-from-llvm-an-sc21...

morelisp · on May 11, 2023

Again, "compile to an IR" isn't exactly ground-breaking. The devil will be in the details.

posco · on May 11, 2023

Great implementation is genius.

morelisp · on May 11, 2023

But grueling and desperately dependent on broader trends.

I wouldn't want to bet on "lots of work for a chance to get incrementally better."

pjmlp · on May 12, 2023

It has mostly failed, because contrary to the other dynamic languages, back to the early Lisp compilers, there is a community resistance to JIT adoption and most relevant refactoring C API on CPython.

In no way is Python more dynamic than Smalltalk, SELF or Common Lisp, which can at any given time redefine any object across the whole execution image and were/are mostly bootstraped environments.

grumpyprole · on May 11, 2023

Mojo compiles "Python" to heterogeneous hardware, I don't believe that has been tried before.

KeplerBoy · on May 11, 2023

It has.

There's Numba, CuPy, Jax and torch.compile. Arguably they are more like DSLs, which happen to integrate into Python than regular Python

Of course I don't know what Mojo will actually bring to the table since their documentation doesn't mention anything GPU specific, but the idea isn't completely novel.

tgma · on May 11, 2023

That comparison seems quite cherry picked. Unlikely that it is generalizable. In my tests with Mojo, it doesn’t seem to behave like Python at all so far. Once they add the dynamism necessary we can see where they land. I’m still optimistic (mostly since Python has absolute garage performance and there’s low hanging fruit) but it’s no panacea. I feel their bet is not to have to run Python as is but have Python developers travel some distance and adapt to some constraints. They just need enough momentum to make that transition happen but it’s a transition to a distinct language, not a Python implementation. Sort of what Hack is to PHP.

SOLAR_FIELDS · on May 11, 2023

Hack as a comparison is probably not a great thing for Mojo, since no one outside of Facebook uses Hack.

tgma · on May 12, 2023

Safe to say no one outside of Modular uses Mojo either at the moment.

Alex3917 · on May 11, 2023

How does that reconcile with the benchmarks here, which say that Python 2.12 is currently somewhere between 5% slower to 5% faster?

https://github.com/faster-cpython/benchmarking-public

ehsankia · on May 11, 2023

35,000x suddenly becomes 30,000x, not as impressive!

make3 · on May 11, 2023

if it's an improvement that big it needs to be with GPUs, & gpus can be used normally with torch etc