Hacker News new | past | comments | ask | show | jobs | submit login
Intel Distribution for Python (intel.com)
139 points by EntICOnc on July 21, 2021 | hide | past | favorite | 82 comments



> the Intel CPU dispatcher does not only check which instruction set is supported by the CPU, it also checks the vendor ID string. If the vendor string says "GenuineIntel" then it uses the optimal code path. If the CPU is not from Intel then, in most cases, it will run the slowest possible version of the code, even if the CPU is fully compatible with a better version.[1]

I’ve been a little shy about using intel software since reading about this years ago

[1] https://www.agner.org/optimize/blog/read.php?i=49


This has gotten a bit better. Last time I checked, MKL now uses Zen-specific kernels for sgemm/dgemm. Unfortunately, these kernels are slower than the AVX2 kernels. But at least, it does not use the pre-modern SIMD kernels for AMD Zen anymore.

Edit, comparison:

    $ perf record target/release/gemm-benchmark  -d 1024
    Threads: 1
    Iterations per thread: 1000
    Matrix shape: 1024 x 1024
    GFLOPS/s: 96.36
    $ perf report --stdio -q | head -n3
        97.18%  gemm-benchmark  gemm-benchmark      [.] mkl_blas_def_sgemm_kernel_0_zen
         1.94%  gemm-benchmark  gemm-benchmark      [.] mkl_blas_def_sgemm_scopy_down16_bdz
         0.78%  gemm-benchmark  gemm-benchmark      [.] mkl_blas_def_sgemm_scopy_right4_bdz
After disabling Intel CPU detection:

    $ perf record target/release/gemm-benchmark  -d 1024
    Threads: 1
    Iterations per thread: 1000
    Matrix shape: 1024 x 1024
    GFLOPS/s: 129.12
    $ perf report --stdio -q | head -n3
        97.02%  gemm-benchmark  libmkl_avx2.so.1        [.] mkl_blas_avx2_sgemm_kernel_0
         1.77%  gemm-benchmark  libmkl_avx2.so.1        [.] mkl_blas_avx2_sgemm_scopy_down24_ea
         1.02%  gemm-benchmark  libmkl_avx2.so.1        [.] mkl_blas_avx2_sgemm_scopy_right4_ea
Benchmarked using https://github.com/danieldk/gemm-benchmark and oneMKL 2021.3.0.


How could one do your trick on Windows?


That's just plain sinister.

I'm really surprised popular numerical computing Python packages don't already have optimized hardware back-ends for things like NumPy... similar to ORC (OIL) which has been around for quite some time:

https://github.com/GStreamer/orc

But I don't know that much about Python under the hood, and I'm willing to be since so many academics work on this there's already optimized FFIs. I've used TensorFlow and it can offload tensor math to GPUs, but only NVIDIA's AFAIK.


I'm really surprised popular numerical computing Python packages don't already have optimized hardware back-ends for things like NumPy

I think it is hard to beat modern BLAS implementations for common operations. E.g. Apple Accelerate (which also implement the BLAS/LAPACK APIs) uses undocumented AMX instructions for large speedups compared to an ARM NEON implementation.


There is a longstanding issue around MKL and OpenBLAS optimization flags making intel systems artificially faster than amd ones for numpy computations. https://stackoverflow.com/questions/62783262/why-is-numpy-wi...

If there are true optimizations to be had, wonderful. But those should be added to core binaries pypi / conda. I am worried that Intel here may be trying to again artificially segment their optimization work on their math libraries for business rather than technical reasons.



At least single-threaded "large" OpenBLAS GEMM has always been similar to MKL once it has the micro-architecture covered. If there's some problem with the threaded version (which one?), has it been reported like it would be for use in Julia? Anyway, on AMD, why wouldn't you use AMD's BLAS (just a version of BLIS). That tends to do well multi-threaded, though I'm normally only interested in single-threaded performance. I don't understand why people are so obsessed with MKL, especially when they don't measure and understand the measurements.


What do you mean by ‘artificially faster’?


Intel libraries whitelist their own CPUs for using certain extension instruction sets, instead of using the relevant CPU ID feature flag for that feature as their own documentation tells you to.


CPUID is insufficient. CPUID can tell you that a CPU has a working PDEP/PEXT, but it can't tell you that a CPU's PDEP sucks like the one on all AMD processors prior to Zen3.


This argument crops up every time but it’s irrelevant; MKL does and always has worked absolutely fine on AMD processors with the checks disabled, and no, reproducibility is not a feature of MKL that is enabled by default and it never was. Intel even had to add a disclaimer that MKL doesn’t work properly on non-Intel processors after legal threats, and they still ran with that for literally years despite knowing it could just be fixed.

When this first cropped up, I was using Digg.

Edit: removed note that they fixed the cripple AMD function; they didn’t, they actually just removed the workaround that made it easier to disable the checks; I was misinformed. Apparently now some software does runtime patching to fix it, including Matlab...


Recent MKL will generate reasonable code for Zen if you set a magic environment variable, but it was very limited (possibly only sgemm and/or dgemm when I looked). Once you've generated AVX2 with a reasonable block size, you're most of the way there. But why not just use a free BLAS which has been specifically tuned for your AMD CPU (and probably your Intel one)?


Nope, they removed support for the magic environment variable in the latest MKL release.


I stand corrected. No loss, anyway. (I probably saw it in oneapi from sometime last year, and was surprised.)


Yeah I don't think all the hacks are out, yet. But my point is only that the availability of some feature is not the only input to the decision to use that feature at runtime. Some of these conditions may look suspiciously like shorthand for IsGenuineIntel(), even if they are legit, like blacklisting BMI2 on AMD, because BMI2 on AMD was useless over most of its history.


The real answer is to do feature probing and benchmarking the underlying implementation. In the cloud you never really know the hardware backing your instance.


From a practical perspective you have to use some BLAS library. If there is a working alternative from AMD, it would be great if you share it. They did have one in the past although I don’t recall its name.


Thanks for bringing out that link, I had had that nagging question about how specific Intel performance libraries were to Intel hardware. At least in this case, it seems not much.


That SO performance benchmark would be so much more useful if the OP had also run OpenBlas on the xeon.


what, no Debian/Ubuntu ? sigh


Of course:

    echo "deb https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
You can read the "apt" section of the package managers, if that's what you prefer. https://software.intel.com/content/www/us/en/develop/documen...



Do AMD even have optimized packages available? Don't get me wrong, I'm not a huge fan of what Intel get up to but AMD's profiling software is dreadful so I'm not exactly surprised that Intel don't even entertain the option.


Quite unsurprisingly, this distribution has no support for ARM: https://software.intel.com/content/www/us/en/develop/article...

I once was excited about Intel releasing their own Linux distro (Clear Linux), but it has the same problem. It looks like Intel is trying to make custom optimized versions of popular open-source projects just to get people to use their CPUs, as they lose their leadership in hardware.


I'm not sure I see why you would expect anything different? The entire point of this framework is to provide a bunch of tools for squeezing the most you can out of SSE, which is specific to x86.

I don't know if there's an ARM-specific equivalent, but, if you want to use TensorFlow or PyTorch or whatever on ARM, they'll work quite happily with the Free Software implementations of BLAS & friends. If you code at an appropriately high level, the nice thing about these libraries is that you get to have vendor-specific optimizations without having to code against vendor-specific APIs. Which is great. I sincerely wish I had that for the vector-optimized code I was writing 20 years ago. In any case, if ARM Holdings or a licensee wants to code up their own optimized libraries that speak the same standard APIs (and assuming they haven't already), that would be awesome, too. The more the merrier. How about we all get in on the vendor-optimized libraries for standard APIs bandwagon. Who doesn't want all the vendor-specific optimizations without all the vendor lock-in?

Alternatively, if you would rather get really good and locked in to a specific vendor, you could opt instead to spam the CUDA button. That's a popular (and, as far as I'm concerned, valid, if not necessarily suited to my personal taste) option, too.


"Their" CPUs meaning x86 platforms, in this case.

Plus, who's surprised? This is how Intel makes money. The consumer segment is a plaything for them, the real high-rollers are in the server segment, where they butter them up with fancy technology and the finest digital linens. Is it dumb? A little, but it's hardly a "problem" unless you intended to ship this software on first-party hardware which, hint-hint, the license forbids in the first place.

At the end of the day, this doesn't really irk me. I can buy a compatible processor for less than $50, that's accessible enough.


No, Their CPUs as in ones from Intel. Intel has long done a thing in their compilers where they detect the CPU model, and run less optimized code if it isn't Intel. They claim it is because they can't be sure "Other" processors have correctly implemented SSE and other extensions. So Intel Linux is going to run faster on an Intel CPU because it was compiled with ICC.


I don't know much about it, but Intel's clear linux does not use icc this is in their FAQ https://docs.01.org/clearlinux/latest/FAQ/index.html#does-it...


This is trivially easy to defeat, just so you know. If anyone reading is ever in need of optimized math library performance on AMD, just speak to your hardware/cloud vendor; they all know the tricks.


Link says Core Gen 10 or Xeon so you may be out of luck on AMD or at less than $50.

I think this is more likely aimed at AMD than Arm - don't think Arm is yet a threat in this space - and whilst they're entitled to do what they want it does make me less enthused about Intel and frankly more likely to support their competitors.


AMD has their own equivalent: https://developer.amd.com/amd-aocl/

I'm not sure it's a sin for hardware manufacturers to support their products? In the days of yore, we even expected it of them.


Not a sin but it's not really just about supporting (or optimising) their products, its about doing so whilst trying to increase the lock-in beyond what is achieved on performance grounds alone.

I may be wrong but my experience is that AMD has been a bit better on this is the past e.g their OpenCL libraries supported both Intel and AMD whereas Intel's were Intel only.


I would assume that's not entirely a fair comparison, though. Intel's 3D acceleration hardware only ever appears in Intel-manufactured chipsets, which only ever contain Intel-manufactured CPUs.

AMD, on the other hand, also supplies Radeon GPUs for use with Intel CPUs. For example, that's the setup in the computer on which I'm typing this.

So I have a hard time seeing anything nefarious there. The one is obviously a business necessity, while the other would obviously be silly. Perhaps that changes with the new Xe GPUs?


Sorry, should have been clearer - Intel's CPU OpenCL drivers only supported Intel and not AMD whereas the AMD's CPU OpenCL drivers supported both - so GPUs not relevant in this case.

I can see how if you've invested a lot in software you'd like to get a competitive advantage over your nearest rival so maybe a price we have to pay.


Yes. The difference is that may be "theirs", but I think it's all free software. At least the linear algebra stuff is. They supply changes for BLIS (which seem not to get included for ages). Their changes may well be relevant to Haswell, for instance. I don't remember what the difference in implementation was for Zen and Haswell, but they were roughly the same code at one time.


I wonder what features are missing from a Comet Lake generation Pentium, those can be had for ~$70 these days. Other than the feature of the box says "Core" on it instead of "Pentium".

EDIT: Ah, I found it, AVX2.


the capital model for cost recovery and earnings is one thing, but in the modern times, the amount of money that flows through Intel Inc. is not the same thing. Intel played dirty for long years to crush competitors, not "make money" like they need it.. "Greed is good" - remember that ? so, no.. apologists count your quarterly dividends but you have no platform for social advocacy here IMO


Clear Linux looked unconvincing to me. When I looked at their write-up, the example of what they say they do with vectorization was FFTW. That depends on hand-coded machine-specific stuff for speed, and the example was actually for the testing harness, i.e. quite irrelevant. I did actually run the patching script for amusement.


Alder Lake looks seriously impressive if the rumoured performance is even close to accurate, so I wouldn't count them out just yet - that being said, they will never get a run like they did over the last 10 years again.


You can easily try it yourself [1]:

    conda create -n intel -c intel intel::intelpython3_core
Or [2]:

    docker pull intelpython/intelpython3_core
Note that it is quite bloated but includes many high-quality libraries.

You can think of it as a recompilation in addition to a collection of patches to make use of their proprietary libraries.

Other useful links to reduce the noise in this thread: [3], [4], [5], [6].

[1] https://software.intel.com/content/www/us/en/develop/article...

[2] https://software.intel.com/content/www/us/en/develop/article...

[3] https://www.nersc.gov/assets/Uploads/IntelPython-NERSC.pdf

[4] https://hub.docker.com/u/intelpython

[5] https://anaconda.org/intel

[6] https://github.com/IntelPython


Any benchmarks comparison data?

   For example:   .... benchmarks with this python is XXX % higher than ... (std python, AMD, ARM)


I haven't done a comparison in a long time, and, even then, it wasn't very thorough, so take this with a grain of salt.

But, 6 years ago, when I was in grad school, just swapping to the Intel build of numpy was an instant ~10x speedup in the machine learning pipeline I was working on at the time.

No idea if that's typical or specific to what I was doing at the time. I don't use MKL anymore because ops doesn't want to deal with it and the standard packages are already plenty good enough for what I'm doing nowadays. If you forced me to guess, I guess I'd have to guess that my experience was atypical.


Why are they making their own distro and not putting code back into mainline if it's useful? Do they have some particular IP that makes this impossibe?


Here's the list of CPUs which incorporate the AVX2 instructions that enable some of these optimizations:

https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#CPU...

You could write your distro to check for flags that will tell it whether or not you have these using flags from /proc/cpuinfo. Or you could check whether it's in the Intel half of the list or the AMD half of the list. Or you could write your own distro that only runs on the first half of the list.

I get that Intel's contributions aren't purely altruistic. There are likely to be subtle tuning problems that require slight changes to optimize on different platforms, and they can't really be expected to do free work for AMD. But it looks to me like they're being unecessarily anticompetitive.


>being unecessarily anticompetitive

Isn't setting up barriers to entry generally considered to be a part of healthy competition? I'd hazard to say that as long as a company is playing within the boundaries of what's allowed, there's nothing they could do that's anticompetitive; at the most, you could accuse them of being somewhat unsportsmanlike.


Isn't setting up barriers to entry generally considered to be a part of healthy competition?

No, it is not. This is better described as vendor lock-in, than a barrier to entry. But vendor lock-in is also against healthy competition.

Healthy competition means that users choose your product because it suits their needs the best, not because they are somehow forced to choose your product.


Competition is desirable because it aligns with society’s goals of innovation and progress which also imply increased productivity and lower prices.

Artificial barriers to entry are contrary to that and if they’re not illegal they should be.


Where do you define this line of barriers becoming 'artificial'?


It's artificial when the vendor expends additional time, effort, or funds to construct a barrier, or chooses an equally-priced non-interoperable design that a rational, informed consumer with a choice would reject. If you're expending great effort to write custom DRM or to reinvent open industry standards that you could have installed cheaply, that's artificial.

I fully admit that there are natural barriers that occur at times. I don't think that you should be expected to reverse-engineer your competitor's products and bend over backwards to make them work better.

Here, for a concrete example, Intel had a clear choice to test whether a processor supported a feature by checking a feature flag - It's in the name, they're literally implemented for that exact purpose - or they could expend extra effort in building their own feature flag database by checking manufacturer and part number. They could have either expended extra effort to launch and distribute their own entire custom Python distribution, or submitted pull requests to the existing main distribution. For another example, Apple could have used industry-standard Phillips or Torx screws in their hardware: Manufacturers had lines to produce them, distributors had inventory of the fasteners, users had tools to turn them. Instead, they went to great expense to build their own incompatible tri-lobe screws, requiring probably millions of dollars in investment in custom tooling and production lines, all for the sake of creating an artificial barrier.


We could start with something similar to the concept of Pareto optimality; Intel could have delivered their maximum performance without preventing optimizations from being applied equally on AMD hardware, but instead they choose to disadvantage AMD without providing anything extra on top of what they could do while remaining "neutral".


I think there is a pretty big base of people who do big data work using Numpy and Pandas (Fintech, etc). They want to squeeze every bit of computing power out of the specific Intel chipset, GPUs, etc and Intel's distro really helps them out.

A 10% speed improvement on 1000's of jobs could in theory save you a nice chunk of time. This becomes very important in the financial market where you need batch jobs to be finished before markets open, or you just want to save 10% on your EC2 bill.


10% is around the noise level for HPC, especially for throughput depending on scheduling. I rather doubt you couldn't do the same as free software.


Yet plenty of HPC installations rather use IBM's xl or NVidia's PGI compiler suits.

So they definitly don't agree you can do the same with free software.


They may or may not disagree on the basis of measured performance in different cases. (The US labs are investing heavily in free software LLVM.) Often not with ifort, at least. However, I was talking about the Intel stuff, and partly from experience with R versus the Microsoft version.

However, I recently ran the Polyhedron Fortran benchmarks with the compilers to hand (~2020 vintage). XL (on POWER) was the only one that gave a significantly better bottom line; obviously IBM know how to compile Fortran well by now (or by Fortran-H). As far as I remember, that was essentially due to the treatment (vectorization) of maths intrinsics, which probably aren't so dominant in typical HPC code. One bad case -- fmod inlining -- has since been fixed in GCC. Without GCC's unfortunate longstanding failure to vectorize sincos (or equivalent), gfortran should have beaten ifort significantly in the bottom line, and at least got close to XLF. PGI was distinctly worse than GCC, but may do better at OpenACC/GPU offload, for instance. XL may win on OpenMP, since some of the current standard was for Sierra's needs. I should find time for the NAS benchmarks.


You are correct, nothing Intel provides in their Python distro cannot be obtained elsewhere - this is just a nice wrapper.


To me this just looks like Intel saw what Nvidia has accomplished with CUDA, locking in large portions of the scientific computing community with a hardware specific API and going "yeah me too thanks"

Thankfully, accelerated math libraries already exist for Python without the vendor lockin.


Intel has been releasing mkl/math kernel libraries for Java for a really long time. Hopefully core python devs can learn a few tricks and similar changes can make it upstream.


Looks like recompilation. I am guessing gains are on numpy and scipy. For python heavy code base, i doubt it can be performant than pypy.


Python 3.7.4 when 3.10 is just around the block.


Maybe I'm missing something but it seems to me that this can only cause fragmentation in the Python space.

Why not use the original distributions?


There are a number of alternate interpreters available. The selling point typically is that they are faster, and seems to be the value proposition of intels.

One use might be improving throughput of a compute bound system, like an etl written in python, with little effort. Ideally just downloading the new interpreter.


Ok. If they offer Python without the GIL then I'm all ears :)


I don't think python is ever going to get rid of the GIL. I haven't looked but there's two things that may speed things up quite a bit: - Use native types - Provide the ability to turn "off" the GIL if you know you will not be using multi-threading within a process.

I guess that is my naive wish list for a short term speed up :)


Jython doesn't have a gil, but It doesn't support python3, and I've never used it.


Jython would also have issues with the many c libraries that python code relies on today.


Numba might be what you're looking for: http://numba.pydata.org/


That looks really interesting. I'll definitely be trying it out. Thanks!!!


A pythonic language that included something analogous to Golangs channels/goroutines would be my ideal.


Julia does have channels similar to those of Go, although if you want to call it pythonic or not is up to you.


I've seen hype for Julia over and over, but this is the first piece of information that's made me genuinely interested.

Thanks for the heads up!

EDIT: Oh god it's 1 indexed


While people discuss a lot about it, in the end 1 indexing doesn't really matter. I think it comes from fortran/matlab.


I agree, it doesn't really matter, but I've been programming long enough that I can see it being that top step that's always half an inch too tall that I'm going to stub my toe on.


For sure, I switch between python, C/C++ an julia a lot and well, lets say bounds errors are pretty common for me.


The "idiomatic" way to access the first element in an array/sequence in julia is to use the `first` function, e.g. `first(arr)` vs. `arr[1]`. This works across a larger number of array types, including OffsetArrays with 0-based index offsets.


My advice would be to use begin and end. Then you don't have to think about the indexing.


Mystique (PR)?


I don't know what Intel did for the proprietary version, but the first thing you should do for Python is to compile with GCC's -fno-semantic-interposition. I don't know if there's a benefit from vectorization, for instance, in parts of the interpreter, or whether -Ofast helps generally if so, but I doubt there's anything Intel CPU-specific involved if there is. I've never looked at it, has the interpreter not been well-profiled and such optimizations provided? Anyway, if you want speed, don't use Python.

It's obviously not relevant to Python per se, but you get basically equivalent performance to MKL with OpenBLAS or, perhaps, BLIS, possibly with libxsmm on x86. BLIS may do better on operations other than {s,d}gemm, and/or threaded, than OpenBLAS, but they're both generally competitive.


So I see Intel and Microsoft both like naming things the Wrong(TM) way around? This name makes about as much sense as WSL... :D


We tried using intel python in one of my previous data science jobs, and ultimately gave up because compatibility with some packages from pip was a nightmare. Alas I can’t quite remember exactly what went wrong.


is there a pip package?


I wonder who the person is who saw python and was like "You know what this needs? INTEL."




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: