Quite old, plain Python is still slow but PyPy is in the league of Node.js.
I rewrote the same program (compare two images, generate a third that shows the diff) in a number of languages. Considering CPython as the reference implementation (1x), I got 100x in Rust, 60x in Go, 12x in Node.js and 10-11x in PyPy.
Initially I got 4x with PyPy but I did a light refactor, removing some map()s and zip()s that were gratitious (3-element lists) and then PyPy went real fast.
You've picked a poor example, I think. And you're probably coding in an inefficient way.
Here's the OpenCV way for a pair of 2048x2048 images:
import cv2
import time
t = time.clock()
a = cv2.imread("./image_1.tiff", cv2.IMREAD_GRAYSCALE);
b = cv2.imread("./image_0.tiff", cv2.IMREAD_GRAYSCALE);
c = a-b
cv2.imwrite("out.tiff", c);
print time.clock()-t
Takes about 0.2 CPU seconds on average for me (note using time.clock, not time.time on UNIX).
#include <opencv2/opencv.hpp>
#include <ctime>
using namespace cv;
using namespace std;
int main(void){
clock_t begin = clock();
Mat a = imread("./image_1.tiff", IMREAD_GRAYSCALE);
Mat b = imread("./image_0.tiff", IMREAD_GRAYSCALE);
Mat c = a - b;
imwrite("out.tiff", c);
clock_t end = clock();
double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
cout << elapsed_secs << endl;
}
Again, about 0.2 seconds. The difference is negligible if you use the right libraries. Python should not be your bottleneck for high performance code.
That's true, but most of the time you can express the "new" stuff in terms of existing fast libraries.
I've had a few cases where you can't, and have become a big fan of Cython for that. It lets you add C typedefs to Python code, and then compile the module at import time. Example here: http://pastebin.com/sF8KmyiU
All of pure Python is still allowed in these modules, but the typedeffed variables become pure C variables instead of objects, and loops become pure C loops. For this particular function, I got a 1000x speedup compared to the original Python code.
In the end, this isn't Python any more - but it's close enough, and only needed for loops that run over millions of items.
I get around 0.07 seconds without the write in C++ and the same in Python (good call though).
I agree that in pure Python it'd be slower, but realistically why would you do that? Unless you work somewhere where you're forced to write your own libraries... but even then you could still implement your own.
If you only removed the write and not the reads from the timing, I would guess the reads (even with warm caches) still dominate the time.
And you would want to do it in pure Python if you want to answer the question "how fast can we make interpreted Python?". Using C extensions for that is cheating, as it isn't Python and it isn't interpreted. You don't answer the question "how fast can you run?" With "30 km an hour, using a bicycle", either.
If you make those images large enough (and I guess 2k x 2k is large enough), any language that uses OpenCV to do the job will give results in the same ballpark. For example, you can make the difference between Python implementations that can call OpenCV as small as you want it.
Not the parent, but I think the point is that, in the real world, the vast majority of use cases for which Python is slow are ones where you would use an existing library written in a lower-level language. So questions like "How fast can we make matrix multiplication in Python?" are irrelevant for the vast majority of Python developers because NumPy exists, and it's always going to be faster than anything you can write in pure Python.
In Ipython %timeit gives 4.3ms per loop, just on a-b. In C++ it's about the same.
I agree that the question is valid - making vanilla Python faster is cool. My point was that this particular example (image processing) was flawed, because it's not something a sane person would ever do in pure Python.
"In Ipython %timeit gives 4.3ms per loop, just on a-b. In C++ it's about the same."
Of course it is about the same. Except for function entry and function exit, which should be a few thousand instructions, at the most, it runs the exact same instruction sequence (if you are using identical versions, compiler and compiler flags)
If you want an easily measurable difference, use way smaller images, and make a few thousand or even a few million calls, or look at the python sources to see how efficiently it calls into C.
I was under the impression that Numpy also just calls BLAS underneath? Hence why doing element-wise calls in Numpy is far, far faster than simply doing nested for loops.
But I think this is the great strength of Python. It's a glue language. If you need speed, you can always write a wrapper around a C/C++ library.
The standard response from the Rust team is that Rust should match or beat non-SIMD C++ performance and if it doesn't, you should file a bug.
Note: The first thing anybody will ask when you complain about Rust being slow is whether you compiled with optimizations turned on (`cargo build --release`) since it tends to make a 10-15x difference.
Theoretically the Rust borrow-checker also knows enough about your code's protection and dispatch semantics such that additional information could be used to create deeper optimizations than are available in either C or C++. Numerical analysis in Rust could compete with Fortran in performance, but I don't know if any of that has been actualized in Rust yet.
But it's high performance given the semantics of the language. The work that has gone into making V8 perform as it does is extraordinary and should be respected, not mocked as 'hilarious'.
The benchmarks show that by using another language his code went 8x faster. Perhaps if he optimized his C++ it would go even faster. It's funny that people are saying 10x off the theoretical is high performance. I wonder if these people are living in a Javascript bubble.
I rewrote the same program (compare two images, generate a third that shows the diff) in a number of languages. Considering CPython as the reference implementation (1x), I got 100x in Rust, 60x in Go, 12x in Node.js and 10-11x in PyPy.
Initially I got 4x with PyPy but I did a light refactor, removing some map()s and zip()s that were gratitious (3-element lists) and then PyPy went real fast.