How fast can we make interpreted Python? (2013)

piquadrat · on April 18, 2016

Title should say 2013. These days, PyPy is way past 25% average speedup with respect to CPython.

/edit to be fair, the project README points out the difference in goals and performance between Falcon and PyPy quite nicely: https://github.com/rjpower/falcon/blob/master/README.md

joshmaker · on April 19, 2016

The Pyston team at Dropbox has a presentation where they say the "real world" performance metrics aren't as good for PyPy as the benchmarks would suggest, allegedly slower than cpython on many metrics.

http://www.slideshare.net/KevinModzelewski/pyston-talk-11101...

It's an argument I haven't heard anywhere else (and one I'm not in a position to substantiate) but presumably Dropbox, who use Python at scale, have as valid perspective on this issue as anyone.

PostOnce · on April 19, 2016

I wrote a game renderer with pygame, its >10x faster in pypy... it's voxels (in the comanche sense).

That's just an anecdote but obviously you should check your use case and find what tool is best suited for your project. There is no magic cure-all, but in my experience pypy comes close.

I'm pretty confident it's also going to be a magic speed boost for a decent size web app I work on, but it remains to be tested with this particular app so I'll check it out... The most bulletproof way to find out what's faster is to test it.

kdeldycke · on April 24, 2016

Comanche's voxels is an elevation map.

PostOnce · on May 3, 2016

That's why I said "in the comanche sense", since this style of raycasting was commonly and erroneously referred to as "voxels" in the 90s (all other raycasters having basically flat floors).

"elevation map" doesn't mean anything in terms of projection, you can project an elevation map onto a 2D display in a hundred ways, "raycaster" does mean something here, a raycasted elevation map or raycasted heightfield is what this is, anyway it's just an example of what types of problems can be solved 10x faster with pypy.

haberman · on April 18, 2016

I thought the 25% number was Falcon vs. CPython, not PyPy vs. CPython. The paper suggests (though does not say directly) that PyPy is faster than Falcon, and that the only benefit of Falcon is compatibility with existing C extensions.

Animats · on April 18, 2016

That's about right. That's about where Unladen Swallow maxed out. To get beyond that point, you have to do more global analysis and JIT compilation and recompilation, as PyPy does.

Every Python symbol reference requires a dictionary lookup. Every function call and operator requires that the type of the left hand side be examined for dispatching purposes. More than 99% of the time, the symbol and type will be the same as last time, and it's a huge win to assume that it will be and compile code for that case. You still have to be prepared for the times when it isn't, and have a backup system. That's basically what PyPy does.

More stuff is mutable in Python than really needs to be mutable.

rcarmo · on April 18, 2016

That is true. I've been fooling around with Hy (hylang.org) for a fairly long time because it's nicer than Clojure for doing glue, and that has made me wonder how fast some things would go if Python had native immutable data types.

gsnedders · on April 18, 2016

> That's about right. That's about where Unladen Swallow maxed out. To get beyond that point, you have to do more global analysis and JIT compilation and recompilation, as PyPy does.

Unladen Swallow did JIT compilation and recompilation, despite not doing better.

ProblemFactory · on April 18, 2016

But compatibility with C extensions is a major selling point.

25% speedup for pure Python code is not that appealing in numerical analysis code, when you can use NumPy, or sprinkle a few typedefs into a critical function and compile with Cython for a 1000x speedup.

If C extension compatibility is broken, then many Python programs will altogether still be slower, despite the pure python speedup.

awinter-py · on April 18, 2016

cython community can definitely use some love -- there's a ton of low-hanging fruit (unboxing arrays of extension classes, for example) they haven't gotten to because of time / funding constraints.

Not an apples to apples comparison, but my guess is the state of the art of JIT is way ahead of what the cython compiler can detect; some of the JIT tricks will likely work in the cython static translation step.

awinter-py · on April 18, 2016

And to be clear I love cython, it's very useful as-is -- there's a large community of people for whom C & python expertise have gone hand in hand for years, and cython is the tool they end up using to max productivity and minimize surprises.

piquadrat · on April 18, 2016

Rereading my comment, it can easily be misunderstood. Sorry about that. The 25% average speedup they cite is indeed relative to CPython. Thanks for clarifying!

awinter-py · on April 18, 2016

I think the newest version of RPython has better FFI; they've been blogging about lxml compatibility for a few months.

joejev · on April 19, 2016

pypy only does python 2. Some people want to use the new features of python.

dang · on April 18, 2016

Thanks, we added the year above.

epx · on April 18, 2016

Quite old, plain Python is still slow but PyPy is in the league of Node.js.

I rewrote the same program (compare two images, generate a third that shows the diff) in a number of languages. Considering CPython as the reference implementation (1x), I got 100x in Rust, 60x in Go, 12x in Node.js and 10-11x in PyPy.

Initially I got 4x with PyPy but I did a light refactor, removing some map()s and zip()s that were gratitious (3-element lists) and then PyPy went real fast.

joshvm · on April 18, 2016

You've picked a poor example, I think. And you're probably coding in an inefficient way.

Here's the OpenCV way for a pair of 2048x2048 images:

    import cv2
    import time

    t = time.clock()

    a = cv2.imread("./image_1.tiff", cv2.IMREAD_GRAYSCALE);
    b = cv2.imread("./image_0.tiff", cv2.IMREAD_GRAYSCALE);

    c = a-b

    cv2.imwrite("out.tiff", c);

    print time.clock()-t

Takes about 0.2 CPU seconds on average for me (note using time.clock, not time.time on UNIX).

    #include <opencv2/opencv.hpp>
    #include <ctime>

    using namespace cv;
    using namespace std;

    int main(void){

      clock_t begin = clock();

      Mat a = imread("./image_1.tiff", IMREAD_GRAYSCALE);
      Mat b = imread("./image_0.tiff", IMREAD_GRAYSCALE);

      Mat c = a - b;

      imwrite("out.tiff", c);
  
      clock_t end = clock();
      double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
      cout << elapsed_secs << endl;
    }

Again, about 0.2 seconds. The difference is negligible if you use the right libraries. Python should not be your bottleneck for high performance code.

true_religion · on April 18, 2016

> The difference is negligible if you use the right libraries.

OpenCV is written in C++, it's going to be fairly efficient to call out to it in any language.

For me, most of the time spent in our code is in 'business logic' which necessarily must be in the main language of the codebase.

That's where PyPy gets its wins.

coryrc · on April 18, 2016

My conclusion is: as long as you don't do anything new, Python is fast enough.

ProblemFactory · on April 19, 2016

That's true, but most of the time you can express the "new" stuff in terms of existing fast libraries.

I've had a few cases where you can't, and have become a big fan of Cython for that. It lets you add C typedefs to Python code, and then compile the module at import time. Example here: http://pastebin.com/sF8KmyiU

All of pure Python is still allowed in these modules, but the typedeffed variables become pure C variables instead of objects, and loops become pure C loops. For this particular function, I got a 1000x speedup compared to the original Python code.

In the end, this isn't Python any more - but it's close enough, and only needed for loops that run over millions of items.

Someone · on April 18, 2016

Now, try implement that matrix subtraction or something more complex such as blurring or edge detection in Python, and compare results.

Also, are you sure that doesn't measure disk speed?

joshvm · on April 18, 2016

I get around 0.07 seconds without the write in C++ and the same in Python (good call though).

I agree that in pure Python it'd be slower, but realistically why would you do that? Unless you work somewhere where you're forced to write your own libraries... but even then you could still implement your own.

Someone · on April 18, 2016

If you only removed the write and not the reads from the timing, I would guess the reads (even with warm caches) still dominate the time.

And you would want to do it in pure Python if you want to answer the question "how fast can we make interpreted Python?". Using C extensions for that is cheating, as it isn't Python and it isn't interpreted. You don't answer the question "how fast can you run?" With "30 km an hour, using a bicycle", either.

If you make those images large enough (and I guess 2k x 2k is large enough), any language that uses OpenCV to do the job will give results in the same ballpark. For example, you can make the difference between Python implementations that can call OpenCV as small as you want it.

RussianCow · on April 18, 2016

Not the parent, but I think the point is that, in the real world, the vast majority of use cases for which Python is slow are ones where you would use an existing library written in a lower-level language. So questions like "How fast can we make matrix multiplication in Python?" are irrelevant for the vast majority of Python developers because NumPy exists, and it's always going to be faster than anything you can write in pure Python.

joshvm · on April 19, 2016

In Ipython %timeit gives 4.3ms per loop, just on a-b. In C++ it's about the same.

I agree that the question is valid - making vanilla Python faster is cool. My point was that this particular example (image processing) was flawed, because it's not something a sane person would ever do in pure Python.

Someone · on April 19, 2016

"In Ipython %timeit gives 4.3ms per loop, just on a-b. In C++ it's about the same."

Of course it is about the same. Except for function entry and function exit, which should be a few thousand instructions, at the most, it runs the exact same instruction sequence (if you are using identical versions, compiler and compiler flags)

If you want an easily measurable difference, use way smaller images, and make a few thousand or even a few million calls, or look at the python sources to see how efficiently it calls into C.

tanlermin · on April 18, 2016

Ok. But I would do it in Numba.

joelg236 · on April 18, 2016

Be aware that python opencv uses c++ calls, it's only a wrapper. SciPy or numpy might be better examples.

joshvm · on April 18, 2016

I was under the impression that Numpy also just calls BLAS underneath? Hence why doing element-wise calls in Numpy is far, far faster than simply doing nested for loops.

But I think this is the great strength of Python. It's a glue language. If you need speed, you can always write a wrapper around a C/C++ library.

dietrichepp · on April 19, 2016

Funny you should say C/C++, because BLAS is Fortran.

howeman · on April 19, 2016

Actually, it's mostly assembly (depending on the particular implementation). Lapack is Fortran though.

ihnorton · on April 18, 2016

> SciPy or numpy might be better examples.

Python comprises less than 50% of the code in both of those repositories, but they are certainly great for learning the CPython API.

sorenjan · on April 18, 2016

I've never used Rust, how would it compare to a C++ solution?

grayrest · on April 18, 2016

The standard response from the Rust team is that Rust should match or beat non-SIMD C++ performance and if it doesn't, you should file a bug.

Note: The first thing anybody will ask when you complain about Rust being slow is whether you compiled with optimizations turned on (`cargo build --release`) since it tends to make a 10-15x difference.

im_down_w_otp · on April 18, 2016

Theoretically the Rust borrow-checker also knows enough about your code's protection and dispatch semantics such that additional information could be used to create deeper optimizations than are available in either C or C++. Numerical analysis in Rust could compete with Fortran in performance, but I don't know if any of that has been actualized in Rust yet.

viperscape · on April 18, 2016

I think some of that might start to come along with increase in compiler plugins, which is feature gated to nightly build right now.

hardwaresofton · on April 18, 2016

By what measurement?

I think the only thing I could say would be that it would be safer? And possible terser and possibly easy to understand.

frozenport · on April 18, 2016

As a C++ developer I find it hilarious that Node.js is a high performance league! ( your benchmark for example shows Rust as 6x faster)

chrisseaton · on April 18, 2016

But it's high performance given the semantics of the language. The work that has gone into making V8 perform as it does is extraordinary and should be respected, not mocked as 'hilarious'.

yeukhon · on April 18, 2016

I believe the high performance usually refers to NodeJS' non-blocking I/O.

frozenport · on April 21, 2016

The benchmarks show that by using another language his code went 8x faster. Perhaps if he optimized his C++ it would go even faster. It's funny that people are saying 10x off the theoretical is high performance. I wonder if these people are living in a Javascript bubble.

_ihaque · on April 18, 2016

Stack->register JIT compiler for Python that maintains compatibility with existing C extensions because it runs as an extension within CPython. Average of 25% faster to up to 2.5x on the benchmarks in the paper.

Source link: https://github.com/rjpower/falcon/ (doesn't show up in the paper till the very end!)

antman · on April 18, 2016

I have been using cython, as presented here [0] with huge speedups. Prototype in python and then make a few changes to produce python code.

https://spacy.io/blog/writing-c-in-cython

smortaz · on April 18, 2016

there's also Microsoft's effort:

https://github.com/Microsoft/Pyjion

[disclaimer]

oldmanjay · on April 19, 2016

What are you disclaiming? I have to admit, having been a bit baffled when people started disclaiming their credentials against all definitional expectation, just seeing a bare disclaimer with nothing obviously being disclaimed, I am completely confused.

seabrookmx · on April 18, 2016

This is actually really cool!

heydenberk · on April 18, 2016

Previous posting: https://news.ycombinator.com/item?id=6112995

25 comments and good discussion there.

pepijndevos · on April 18, 2016

For all those talking about PyPy, have a look at Pyston, an ongoing effort by Dropbox to build a fast Python.

https://github.com/dropbox/pyston

forgotpwtomain · on April 18, 2016

Any idea why they decided to write their own JITed python implementation rather than putting in additional support for PyPy ?

joshmaker · on April 19, 2016

According to this slide their "real world" tests of PyPy didn't live up to the benchmarks, and in fact showed "no clear improvement" compare with cpython

http://www.slideshare.net/KevinModzelewski/pyston-talk-11101...

Fede_V · on April 18, 2016

C-API compatibility, I'm guessing. I agree it was of a strange choice though.

andreasvc · on April 19, 2016

PyPy also has higher memory usage.

txdv · on April 19, 2016

I'm kinda jealous that there is so much effort put into Python... I like to use ruby as a scripting language...

mangeletti · on April 19, 2016

Something worth noting:

Python 3.6 has a number of refactorings of standard library components[1], and while that doesn't effect the CPython interpreter itself, these enhancements should do a lot to speed up applications that make heavy use of the standard library.

1. Some are mentioned at https://docs.python.org/3.6/whatsnew/3.6.html (search page for "fast")

vegabook · on April 18, 2016

Ah yes. 25% faster.

I get 50x faster (that would be 5000%) just dipping into Numpy when I need to (admittedly with AVX), and C when I have to. Both are like, trivially easy.

Why are we bothering with making native Python marginally faster when it is already the perfect tool for the mission it is there to accomplish (glue), and there are dozens of other tools which are optimized for performance?

Animats · on April 18, 2016

"Both are like, trivially easy."

Until you screw up memory allocation and have to debug.

There's a race condition in Python 3's CPickle that corrupts memory.[1] I can't reproduce it well enough to submit a bug report that won't be ignored.

[1] http://bugs.python.org/issue23655

burfog · on April 19, 2016

Debugging gets way easier if you finish the C-to-Python transition, ripping out the last bit of Python. Having the interpreter running makes debugging way harder than it needs to be. Lose that, and suddenly you can take advantage of all sorts of powerful debugging tools. (valgrind, less-painful use of a standard debugger, -fsanitize= compiler options, C interpreters, coverity, etc.)

Ditching the python also dramatically improves start-up latency.

rcarmo · on April 18, 2016

Because we need faster glue? :)

vegabook · on April 18, 2016

super glue!

rcarmo · on April 18, 2016

Precisely. :)

Also, see above my comment on using Hy (hylang.org) for LISPy glue.

exabrial · on April 19, 2016

But why? I'm not trolling, I mean this as sincere non hostile criticism... Static types alone improve code quality and execution speed, and it's been shown hardly anyone actually exploits dynamic types anyway.

eveningcoffee · on April 18, 2016

Is it at all possible to make Python really fast given that it depends on GIL, that to my understanding makes Python performance memory bound?

andreasvc · on April 19, 2016

The GIL is an issue when you want parallelism. It is not relevant when talking about single thread performance.

eveningcoffee · on April 19, 2016

As I understand it, it eliminates some optimization possibilities and adds overhead.

I do not have time to check the Python code to see how it is actually implemented (hence my question) but based on my knowledge it would imply at least a CAS operation to check and take the lock, writing the register values into memory (cache) and applying a memory barrier.

You can not keep the values in the registries (elimination of optimization possibilities) and you add considerable overhead by needing memory barriers and CAS operations.

I am not claiming that Python does it like this, I am just assuming that it should do it like this to obtain the guaranties of GIL.

smegel · on April 19, 2016

The GIL is a condition variable and a mutex. Nothing fancy. In a single threaded program, it gets acquired once, if at all.

eveningcoffee · on April 19, 2016

Mutex and a shared condition variable are expensive compared to the single instruction.

But I had an impression that it is acquired before every atomic Python instruction and it looks that it is actually acquired for group of predefined number of instructions (100) that then are executed inside one GIL time frame [0].

Therefore it actually should not be a big obstacle to make Python code run fast by a JIT compiler.

[0] http://www.dabeaz.com/python/GIL.pdf

hodwik · on April 18, 2016

Could you explain why GIL wouldn't make it CPU bound, rather than memory bound?

eveningcoffee · on April 19, 2016

I am sorry, but by memory I meant cache memory that is in fact inside the CPU. This is error on my side.

I read [0] and could not infer anything to confirm my understanding, but if we consider the semantics of GIL, then there should be some form of guarantee in place that two CPU cores see the coherent picture of the cache (other way two threads would see incoherent values of the same variable or would not see the changes at all). This is usually archived by some form of memory barriers and flushing the registers into memory and this by my experience (that comes from Java) makes the code order of magnitude or two slower.

[0] https://wiki.python.org/moin/GlobalInterpreterLock

baccheion · on April 19, 2016

Very.. if a VM or JIT compiler is added to the mix, and optional type hints. There are many potential Python optimizations that can't be done that are suddenly possible if something can provide real time code monitoring/optimization.

continuations · on April 18, 2016

How does Falcon compare to PyPy performance wise?

amelius · on April 18, 2016

Why not write a Py->JS transpiler to solve the efficiency problem?

talideon · on April 18, 2016

A big part of the reason why the likes of V8 are fast is because they can concentrate on being single-threaded. If you wanted to write a transpiler, it would need to be targeted at a JS implementation that allowed for multiple threads to be run at the same time. If you did that, you'd have a JS runtime with similar problems to CPython.

rpearl · on April 18, 2016

I can't tell if you're joking, but if not... No, JS isn't more efficient than Python.

infogulch · on April 18, 2016

I couldn't tell either. Poe's law in full effect.

RubyPinch · on April 19, 2016

having a Python JITing VM written in python, compiled to javascript, can actually be faster than CPython (after warmup!)

https://www.rfk.id.au/blog/entry/pypy-js-faster-than-cpython...

Johnny_Brahms · on April 19, 2016

Maybe not, but if you are doing single threaded stuff in python you will probably get better performance by using node. Now, I'm no python or javascript programmer, but I generally get a lot better performance for small scripts using node.

outworlder · on April 18, 2016

Sarcasm detector is also failing.

But even compiling to C would not magically solve the issues.