Is sorted using SIMD instructions

CodeArtisan · on April 15, 2018

When compiling with GCC, the option `-fopt-info-vec-all` gives you information about the vectorization of the code. In this case, GCC reports

   // the for block
   <source>:10:24: note: ===== analyze_loop_nest =====
   <source>:10:24: note: === vect_analyze_loop_form ===
   <source>:10:24: note: not vectorized: control flow in loop.
   <source>:10:24: note: bad loop form.
   <source>:5:6: note: vectorized 0 loops in function.
   
   // the if block inside the for block
   <source>:11:9: note: got vectype for stmt: _4 = *_3;
   const vector(16) int
   <source>:11:9: note: got vectype for stmt: _8 = *_7;
   const vector(16) int
   <source>:11:9: note: === vect_analyze_data_ref_accesses ===
   <source>:11:9: note: not vectorized: no grouped stores in basic block.
   <source>:11:9: note: ===vect_slp_analyze_bb===
   <source>:11:9: note: ===vect_slp_analyze_bb===
   <source>:11:9: note: === vect_analyze_data_refs ===
   <source>:11:9: note: not vectorized: not enough data-refs in basic block.

edit: with intel compiler using `-qopt-report=5 -qopt-report-phase=vec -qopt-report-file=stdout`

   Begin optimization report for: is_sorted(const int32_t *, size_t)

        Report from: Vector optimizations [vec]
    
    LOOP BEGIN at <source>(12,5)
    
       remark #15324: loop was not vectorized: unsigned types for induction
                      variable and/or for lower/upper iteration bounds make
                      loop uncountable

    LOOP END

stabbles · on April 15, 2018

You can get automatic vectorization (with -O3) like this:

    bool is_sorted(const int32_t* input, size_t n) {
      int32_t sorted = true;

      for (size_t i = 1; i < n; ++i) {
        sorted &= input[i - 1] <= input[i];
      }

      return sorted;
    }

And the performance is similar to the AVX version (benchmarked on a MacBook Air early 2015):

    $ ./benchmark_avx2 1048576
    input size 1048576, iterations 10
    scalar         : 6379 us
    SSE (generic)  : 3544 us
    SSE            : 3704 us
    My example     : 2769 us
    AVX2 (generic) : 2679 us
    AVX2           : 3360 us

So I'm getting 2769us with the above 5 simple lines of code. It's just 3% slower (that might be noise).

Veedrac · on April 15, 2018

Though this does throw away early-exit, which means it will be many times slower for unsorted cases.

reza_n · on April 15, 2018

Maybe have a parent function which passes in the array as 4k chunks/offsets and checks the return? 4k is a random size, there is probably a value which hits the sweet spot here.

stabbles · on April 15, 2018

True, I had hoped GCC would optimize that -- it doesn't :(.

xamuel · on April 15, 2018

It would make no sense for GCC to automatically add early-exit: that requires a judgement call about the vectors the function is intended to run on. A priori, the function might be intended to run on vectors that are almost always sorted, in which case the extra branch would be severely suboptimal.

banachtarski · on April 15, 2018

I agree that the optimization isn't possible but this is for correctness reasons. The performance penalty of the extra branch is almost negligible due to branch prediction.

stochastic_monk · on April 15, 2018

Which compiler?

stabbles · on April 15, 2018

Only tried GCC on Godbolt

konschubert · on April 15, 2018

Am I reading this right:

GCC cannot vectorise because the loop terminates early in case of a non-sorted pair. It thereby contains control flow.

I guess that's kind of logical. Even if the compiler recognised the early termination of the loop as the optimisation it is, it would still have to make the decision to give up on it in favour of vectorisation (?)

CJefferson · on April 15, 2018

Actually, this optimisation is illegal. I could line up memory such that it is illegal to read past the first ___location where the array is not sorted. The original code would be fine, the vectorised code would segfault.

Similar annoying problems arise when people try to be clever with C strings, and read past the null. You can write optimisations which work, but they require care.

Too · on April 15, 2018

In this case though the length of data array is supposed to be less than n, not terminated by the first unsorted pair. Reading past the first unsorted pair is valid as long as you don't read past n.

Is there a way to tell the compiler this? I've been trying with std::vector and std::array instead of raw pointer but without any luck. std::array also constrains you to static length.

Too · on April 16, 2018

Again the lack of a built in array type with both data+length shows. Leaving this out of C must be the most expensive mistake in progranning history, after null. Imagine all the security holes stemming from that and now also it shows that it is preventing optimizers. At least now it's finally available with cppcoreguidelines.

CodeArtisan · on April 16, 2018

https://port70.net/~nsz/c/c11/n1570.html#6.7.6.3p7

"A declaration of a parameter as ''array of type'' shall be adjusted to ''qualified pointer to type'', where the type qualifiers (if any) are those specified within the [ and ] of the array type derivation. If the keyword static also appears within the [ and ] of the array type derivation, then for each call to the function, the value of the corresponding actual argument shall provide access to the first element of an array with at least as many elements as specified by the size expression."

According to this, the following syntax could be used for optimization

    void foo(size_t n, int array[static n]) {...}

pointers to array could be used too

    void foo(size_t n, int (*array)[n])
    {
        printf("array %zu, *array %zu\n", sizeof(array), sizeof(*array));
    }

    foo( 10, null); // array 8, *array 40
    foo(100, null); // array 8, *array 400

ziedaniel1 · on April 15, 2018

Actually, the illegal memory access would be undefined behavior, so it's fine for the compiler to assume that it's living in a world where the segfault never happens. Thus, it can optimize away the extra reads. If this weren't allowed, it would be very hard for compilers to eliminate any unnecessary reads.

This sort of optimization reasoning can result in quite surprising behavior: http://blog.llvm.org/2011/05/what-every-c-programmer-should-...

jjnoakes · on April 15, 2018

I disagree.

If I write a routine which walks an array one element at a time, in order, up to some max index N, and also stops early when it reaches some other condition (like an element equal to zero), then I am allowed to pass in memory which is only 3 elements long, and an N greater than 3, if I know that there is a zero in the first 3 elements. The function must not be optimized to read past the zero element, regardless of the N passed in, so removing the early exit would be an invalid optimization.

There's no undefined behavior in the above that justifies that optimization.

CJefferson · on April 15, 2018

No, not in this case. The original function is well defined and has no undefined behaviour (in the case I describe) as it would return before it reached "bad memory". The optimised version is what reaches further through memory (while vectorising).

ziedaniel1 · on April 16, 2018

Oh, good point, I didn't read what you wrote carefully enough.

ynik · on April 15, 2018

The compiler is allowed to eliminate unnecessary reads; but vectorization requires introducing additional reads. That's not allowed in general. In this case the compiler could vectorize the loop, though it would have to take care that the additional reads are within the same 4K page as the original reads, to prevent introducing segfaults. But that's usually the case for vectorized reads, as long as the compiler takes care of alignment.

justincormack · on April 15, 2018

Even if you adjust it to not terminate early it finds other reasons (doesnt vectorise boolean operations).

redcalx · on April 15, 2018

I think the reduction in the number of branches inside the loop is a significant factor here, i.e. the if statement is performed per 4,8 or 16 elements instead of per value element. Branches invoke branch prediction logic which is non trivial, thus even though it may not slow execution it may increase power consumption.

On this basis another way of approaching the problem is to loop over the whole array and return the result at the end, but this of course would take N/2 loops on average, whereas the nested if can exit early. A compromise might be to loop over short sub-spans of the array, and do an exit early test at the end of each sub-span.

A good sub-span length for scalar code might be around 16, for one because we hit the law of diminishing returns for longer spans.

Also I think is_sorted_asc() or is_sorted_ascending() might be a better name if this were for a function in a general purpose library.

progval · on April 15, 2018

> A compromise might be to loop over short sub-spans of the array, and do an exit early test at the end of each sub-span.

Another good use of Duff's device!

redcalx · on April 15, 2018

For reference: https://en.wikipedia.org/wiki/Duff%27s_device

stkdump · on April 15, 2018

It seems that the early exit (the return false) in the middle of the loop prohibits the compiler from vectorizing the loop. The compiler can't know if part of the memory is inaccessible and so reading memory not being sure that the loop would run up to this point is illegal.

If you introduce a result variable and set it during the loop, the compiler can vectorize the loop. At least icc does, but I didn't play with the compiler settings of the other compilers too much.

Also compilers can still exit the loop early and at least MSVC does, because a segfault is UB and can be "optimized away".

Veedrac · on April 15, 2018

FWIW HeroicKatora gives an improved version on Reddit: https://www.reddit.com/r/cpp/comments/8bkaj3/is_sorted_using...

foxhill · on April 15, 2018

using a bit more arcane (but not crazy) method for determining sorted-ness - https://godbolt.org/g/MKN9HP - clang, and icc seem to have no problem vectorising the inner loop.

gcc manages it too, but emits a huge amount of code, though. and setting -Os seems to stop it from vectorising the code. shame.

edit: replaced with more correct version..

zakk · on April 15, 2018

It would be interesting to see how the SIMD versions work for small arrays. I suspect in this case the naive version better, and this could be the reading why compilers do not convert the code to SIMD instructions...

gameswithgo · on April 15, 2018

SIMD tends to always win for pretty small arrays, like around ~25 elements, and probably any size that is evenly divisible by the vector width.

lokopodium · on April 15, 2018

Generic code yields better results than SSE/AVX optimized ones. I wonder why that could be.

slashdev · on April 15, 2018

It's just replacing extra loads with a bunch of other instructions. In reality loads are cheap (and cached), it turns out to be cheaper than doing permutes to shuffle the vectors around.

danbruc · on April 15, 2018

i += 7;

Wouldn't this cause a sizable performance hit due to being misaligned most of the time?

wmu · on April 15, 2018

Author here. This was true several generations ago (core2, for instance), now the performance penalty is negligible.

fulafel · on April 15, 2018

What about ARM?

wmu · on April 18, 2018

Sorry, have no idea.

secure · on April 15, 2018

No: many (most?) modern SIMD instructions don’t require alignment. From the Intel Intrinsics Guide (can’t figure out how to link directly to it, sorry) on _mm_loadu_si128:

> Load 128-bits of integer data from memory into dst. mem_addr does not need to be aligned on any particular boundary.

detaro · on April 15, 2018

Doesn't need, but is there a performance difference? I seem to remember there is no difference between _mm_load_si128 and _mm_loadu_si128 on modern CPUs, but I'm not sure.

mattkrause · on April 15, 2018

loadu is not the best example because its sole purpose is loading unaligned data. There is a separate load for aligned data.

en4bz · on April 15, 2018

Another possible implementation which is log^2(n) https://en.wikipedia.org/wiki/Bitonic_sorter

monochromatic · on April 15, 2018

No, you can’t even read the whole array in O(log^2(n)). It’s not possible to do better than O(n) without “cheating.”

IAmLiterallyAB · on April 15, 2018

I think the trick is you can run it in parallel. Which under ideal circumstances may give that kind of performance. But this is the first I've heard of this algorithm.

alfanick · on April 15, 2018

I always miss multithreaded benchmarks when using SSE/AVX instructions. AFAIK AVX processing units are oversubscribed, there are less of them than CPU cores.

I can imagine that running AVX is_sorted (or any other AVX procedure) in multiple threads would be actually slower than running non-vectorized procedure.

Of course, that's my purely anecdotal opinion.

vardump · on April 15, 2018

> AFAIK AVX processing units are oversubscribed, there are less of them than CPU cores.

Typically I see about 2 SIMD instruction throughput per cycle per core on Intel CPUs. SIMD execution units are not shared between cores in any way.

Clock throttling might happen, but SIMD is usually still a pretty huge net win.

nvartolomei · on April 15, 2018

Here is an experience report on how AVX-512 instructions impact CPU performance https://blog.cloudflare.com/on-the-dangers-of-intels-frequen...

paulmd · on April 15, 2018

Note that Skylake-SP and Xeon-W/i7/i9s behave very differently in this regard. On Skylake-SP (eg Xeon Silvers like they're using) it's over 50% clockrate reduction when AVX-512 is in the pipe, on Xeon-W and the HEDT chips it's more like 10-20%.

https://twitter.com/InstLatX64/status/934093081514831872

floatboth · on April 15, 2018

On HEDT (and mainstream desktop) you can actually adjust AVX offset manually. With 0 offset and 5GHz clock, you can consume 500W (in Prime95 AVX) :D

frankchn · on April 15, 2018

I think AVX units on Intel cores have always been separate (i.e. not shared).

AMD Bulldozer processors have a shared floating point unit and some early pipeline stages like the instruction decoder per pair of cores. AMD Zen processors have since reverted to a more conventional design.

slashdev · on April 15, 2018

It's not that the execution units are shared, but that the frequency is throttled when AVX2, AVX512 instructions are encountered. In general AVX512 is not yet worthwhile when overall system throughput is at stake, and you don't have very vector heavy workloads. AVX2 is worthwhile most of the time. As of Haswell one lane of the AVX2 units was powered down when not in use and instructions execute more slowly when first encountering them. It executes them basically by stitching together two SEE operations. But this doesn't necessarily make it slower than SSE, just that the performance benefits might not materialize if there isn't enough AVX2 code being executed. I don't know if Skylake works that way as well.

jnordwick · on April 15, 2018

That doesnt make sense to me. On haswell/skylake ports 0 and 1 do most of the vector lifting. I don't think cores share any of the vector hardware.

Or are you trying to make some claim about hyperthreading?

alfanick · on April 15, 2018

I simply may be wrong :) There is other AVX related thing [0] - CPU is underclocked in Turbo when AVX is used.

[0]: https://www.anandtech.com/show/11544/intel-skylake-ep-vs-amd...

nimos · on April 15, 2018

I'm pretty sure each core has its own AVX units for the version they support.

AFAIK the only difference between AVX performance (ignoring clock speed) is gold/platinum(5000+) Xeons have 2x512 FMA ports available but everything else only supports FMA on port 1/2 and not 5. Stabbing a bit in the dark here, it's been a bit since I was looking at this stuff.

DSingularity · on April 15, 2018

That’s only correct for avx3 instructions -1 I.e. 512 bit wide vector unit.

nwmcsween · on April 15, 2018

This assumes unaligned access is cheap.

lucb1e · on April 15, 2018

See this thread (posted 7 hours before your comment): https://news.ycombinator.com/item?id=16842012

nwmcsween · on April 16, 2018

Which is OK in this case but for any of the arch independent (SWAR) bit hacks it would probably be better to have a loop to align.

pubby · on April 15, 2018

Unfortunately SSE doesn't speed-up huge data much. It's only fast when everything's in the cache.

0xfaded · on April 15, 2018

I coded up a 5x5 Gaussian blur using NEON instructions. I used the standard seperatable filter technique, one horizontal pass and then one vertical pass. Benchmarking effectively measured 2x the memory round trip time on the raspberry pi. My second implementation rotated 8x16 image blocks using register permute instructions. Only the block borders needed to be cached, meaning everything shared between blocks for a 640x480 image fit in L1. Despite being substantially more complex, I cut the execution time by 30%.

Lesson learned, sometimes the easiest performance gains are found by not being naive about memory access. The extra instructions were inconsequential.

ttd · on April 15, 2018

Curious if you have encountered Halide (http://halide-lang.org/) and if so, your impression or thoughts. The main advantage is being able to easily express optimizations like the one you mentioned, allowing you to experiment with different optimization parameters/ideas more quickly.

0xfaded · on April 16, 2018

I actually drew inspiration from halide. I think it performs well because memory access / cache misses are more expensive than a few extra instructions, but it has its limits. For example, I'm not sure how you'd implement the in-register 90-degree rotation in halide. Therefore you'd probably wind up with an extra round trip to L1.

Impressive, but definitely beatable given sufficient free time.

vardump · on April 15, 2018

Not sure what you mean, I can process non-cached sequential data from RAM over 20 GB/s by using SSE/AVX.

There's no chance you could achieve same by using scalar instructions. SIMD can access memory a lot faster than scalar.

Random access is another matter. The trick is of course to avoid non-sequential access patterns.

Peaker · on April 15, 2018

Isn't the bottleneck in any sequential access case the memory bandwidth?

IOW: Are the scalar instructions slower than memory bandwidth?

obl · on April 15, 2018

you can saturate memory bandwidth without SIMD, since you can issue at least 2 8-byte scalar loads per cycle. it does not leave much room for actual processing though

nimos · on April 15, 2018

Isn't that just a prefetcher win? AVX/SSE lets you do more computational work per cycle but I don't see how it would improve memory bandwidth/access.

cwzwarich · on April 15, 2018

The CPU's load/store unit is usually designed with SIMD in mind, and the access width is 16 bytes (or 32 bytes for Haswell-and-later Intel CPUs). This means you get more bandwidth by using SIMD.

dosshell · on April 15, 2018

Do you have any explanation of why you think so? And how much is hugh data for you?

I work with image processing, >1GiB/s. We optimize for each platform for hand, many embedded platforms but also x86_64. Most of our work is detectors and loss-less compression and we could not do these things in realtime if it weren't for SIMD/MIMD.

pubby · on April 15, 2018

I was not very specific. What I meant to get at was that if your bottleneck is memory, then optimizing for instructions is not going to help. is_sort is an example algorithm that's far more dependent on memory than it is on instruction throughput.

If your bottleneck isn't memory, then yeah, SIMD is a real boon.

tmandry · on April 15, 2018

It’s not all about compute. SIMD instructions allow you to load more bits per instruction cycle. When you are doing sequential access on a modern processor, this can make a huge difference. It allows you to use the memory bandwidth to its full capacity.

emerged · on April 15, 2018

It's rare to have a meaningful algorithm where memory is the absolute only bottleneck. Not much else aside from memcpy and even that can see a benefit from SIMD tuning on many systems.

Especially since high performance software is glad to have even a 1% boost in speed.