When compiling with GCC, the option `-fopt-info-vec-all` gives you information a...

stabbles · on April 15, 2018

You can get automatic vectorization (with -O3) like this:

    bool is_sorted(const int32_t* input, size_t n) {
      int32_t sorted = true;

      for (size_t i = 1; i < n; ++i) {
        sorted &= input[i - 1] <= input[i];
      }

      return sorted;
    }

And the performance is similar to the AVX version (benchmarked on a MacBook Air early 2015):

    $ ./benchmark_avx2 1048576
    input size 1048576, iterations 10
    scalar         : 6379 us
    SSE (generic)  : 3544 us
    SSE            : 3704 us
    My example     : 2769 us
    AVX2 (generic) : 2679 us
    AVX2           : 3360 us

So I'm getting 2769us with the above 5 simple lines of code. It's just 3% slower (that might be noise).

Veedrac · on April 15, 2018

Though this does throw away early-exit, which means it will be many times slower for unsorted cases.

reza_n · on April 15, 2018

Maybe have a parent function which passes in the array as 4k chunks/offsets and checks the return? 4k is a random size, there is probably a value which hits the sweet spot here.

stabbles · on April 15, 2018

True, I had hoped GCC would optimize that -- it doesn't :(.

xamuel · on April 15, 2018

It would make no sense for GCC to automatically add early-exit: that requires a judgement call about the vectors the function is intended to run on. A priori, the function might be intended to run on vectors that are almost always sorted, in which case the extra branch would be severely suboptimal.

banachtarski · on April 15, 2018

I agree that the optimization isn't possible but this is for correctness reasons. The performance penalty of the extra branch is almost negligible due to branch prediction.

stochastic_monk · on April 15, 2018

Which compiler?

stabbles · on April 15, 2018

Only tried GCC on Godbolt

konschubert · on April 15, 2018

Am I reading this right:

GCC cannot vectorise because the loop terminates early in case of a non-sorted pair. It thereby contains control flow.

I guess that's kind of logical. Even if the compiler recognised the early termination of the loop as the optimisation it is, it would still have to make the decision to give up on it in favour of vectorisation (?)

CJefferson · on April 15, 2018

Actually, this optimisation is illegal. I could line up memory such that it is illegal to read past the first ___location where the array is not sorted. The original code would be fine, the vectorised code would segfault.

Similar annoying problems arise when people try to be clever with C strings, and read past the null. You can write optimisations which work, but they require care.

Too · on April 15, 2018

In this case though the length of data array is supposed to be less than n, not terminated by the first unsorted pair. Reading past the first unsorted pair is valid as long as you don't read past n.

Is there a way to tell the compiler this? I've been trying with std::vector and std::array instead of raw pointer but without any luck. std::array also constrains you to static length.

Too · on April 16, 2018

Again the lack of a built in array type with both data+length shows. Leaving this out of C must be the most expensive mistake in progranning history, after null. Imagine all the security holes stemming from that and now also it shows that it is preventing optimizers. At least now it's finally available with cppcoreguidelines.

CodeArtisan · on April 16, 2018

https://port70.net/~nsz/c/c11/n1570.html#6.7.6.3p7

"A declaration of a parameter as ''array of type'' shall be adjusted to ''qualified pointer to type'', where the type qualifiers (if any) are those specified within the [ and ] of the array type derivation. If the keyword static also appears within the [ and ] of the array type derivation, then for each call to the function, the value of the corresponding actual argument shall provide access to the first element of an array with at least as many elements as specified by the size expression."

According to this, the following syntax could be used for optimization

    void foo(size_t n, int array[static n]) {...}

pointers to array could be used too

    void foo(size_t n, int (*array)[n])
    {
        printf("array %zu, *array %zu\n", sizeof(array), sizeof(*array));
    }

    foo( 10, null); // array 8, *array 40
    foo(100, null); // array 8, *array 400

ziedaniel1 · on April 15, 2018

Actually, the illegal memory access would be undefined behavior, so it's fine for the compiler to assume that it's living in a world where the segfault never happens. Thus, it can optimize away the extra reads. If this weren't allowed, it would be very hard for compilers to eliminate any unnecessary reads.

This sort of optimization reasoning can result in quite surprising behavior: http://blog.llvm.org/2011/05/what-every-c-programmer-should-...

jjnoakes · on April 15, 2018

I disagree.

If I write a routine which walks an array one element at a time, in order, up to some max index N, and also stops early when it reaches some other condition (like an element equal to zero), then I am allowed to pass in memory which is only 3 elements long, and an N greater than 3, if I know that there is a zero in the first 3 elements. The function must not be optimized to read past the zero element, regardless of the N passed in, so removing the early exit would be an invalid optimization.

There's no undefined behavior in the above that justifies that optimization.

CJefferson · on April 15, 2018

No, not in this case. The original function is well defined and has no undefined behaviour (in the case I describe) as it would return before it reached "bad memory". The optimised version is what reaches further through memory (while vectorising).

ziedaniel1 · on April 16, 2018

Oh, good point, I didn't read what you wrote carefully enough.

ynik · on April 15, 2018

The compiler is allowed to eliminate unnecessary reads; but vectorization requires introducing additional reads. That's not allowed in general. In this case the compiler could vectorize the loop, though it would have to take care that the additional reads are within the same 4K page as the original reads, to prevent introducing segfaults. But that's usually the case for vectorized reads, as long as the compiler takes care of alignment.

justincormack · on April 15, 2018

Even if you adjust it to not terminate early it finds other reasons (doesnt vectorise boolean operations).