When compiling with GCC, the option `-fopt-info-vec-all` gives you information about the vectorization of the code. In this case, GCC reports
// the for block
<source>:10:24: note: ===== analyze_loop_nest =====
<source>:10:24: note: === vect_analyze_loop_form ===
<source>:10:24: note: not vectorized: control flow in loop.
<source>:10:24: note: bad loop form.
<source>:5:6: note: vectorized 0 loops in function.
// the if block inside the for block
<source>:11:9: note: got vectype for stmt: _4 = *_3;
const vector(16) int
<source>:11:9: note: got vectype for stmt: _8 = *_7;
const vector(16) int
<source>:11:9: note: === vect_analyze_data_ref_accesses ===
<source>:11:9: note: not vectorized: no grouped stores in basic block.
<source>:11:9: note: ===vect_slp_analyze_bb===
<source>:11:9: note: ===vect_slp_analyze_bb===
<source>:11:9: note: === vect_analyze_data_refs ===
<source>:11:9: note: not vectorized: not enough data-refs in basic block.
edit: with intel compiler using `-qopt-report=5 -qopt-report-phase=vec -qopt-report-file=stdout`
Begin optimization report for: is_sorted(const int32_t *, size_t)
Report from: Vector optimizations [vec]
LOOP BEGIN at <source>(12,5)
remark #15324: loop was not vectorized: unsigned types for induction
variable and/or for lower/upper iteration bounds make
loop uncountable
LOOP END
And the performance is similar to the AVX version (benchmarked on a MacBook Air early 2015):
$ ./benchmark_avx2 1048576
input size 1048576, iterations 10
scalar : 6379 us
SSE (generic) : 3544 us
SSE : 3704 us
My example : 2769 us
AVX2 (generic) : 2679 us
AVX2 : 3360 us
So I'm getting 2769us with the above 5 simple lines of code. It's just 3% slower (that might be noise).
Maybe have a parent function which passes in the array as 4k chunks/offsets and checks the return? 4k is a random size, there is probably a value which hits the sweet spot here.
It would make no sense for GCC to automatically add early-exit: that requires a judgement call about the vectors the function is intended to run on. A priori, the function might be intended to run on vectors that are almost always sorted, in which case the extra branch would be severely suboptimal.
I agree that the optimization isn't possible but this is for correctness reasons. The performance penalty of the extra branch is almost negligible due to branch prediction.
GCC cannot vectorise because the loop terminates early in case of a non-sorted pair. It thereby contains control flow.
I guess that's kind of logical.
Even if the compiler recognised the early termination of the loop as the optimisation it is, it would still have to make the decision to give up on it in favour of vectorisation (?)
Actually, this optimisation is illegal. I could line up memory such that it is illegal to read past the first ___location where the array is not sorted. The original code would be fine, the vectorised code would segfault.
Similar annoying problems arise when people try to be clever with C strings, and read past the null. You can write optimisations which work, but they require care.
In this case though the length of data array is supposed to be less than n, not terminated by the first unsorted pair. Reading past the first unsorted pair is valid as long as you don't read past n.
Is there a way to tell the compiler this? I've been trying with std::vector and std::array instead of raw pointer but without any luck. std::array also constrains you to static length.
Again the lack of a built in array type with both data+length shows. Leaving this out of C must be the most expensive mistake in progranning history, after null. Imagine all the security holes stemming from that and now also it shows that it is preventing optimizers. At least now it's finally available with cppcoreguidelines.
"A declaration of a parameter as ''array of type'' shall be adjusted to ''qualified pointer to type'', where the type qualifiers (if any) are those specified within the [ and ] of the array type derivation. If the keyword static also appears within the [ and ] of the array type derivation, then for each call to the function, the value of the corresponding actual argument shall provide access to the first element of an array with at least as many elements as specified by the size expression."
According to this, the following syntax could be used for optimization
Actually, the illegal memory access would be undefined behavior, so it's fine for the compiler to assume that it's living in a world where the segfault never happens. Thus, it can optimize away the extra reads. If this weren't allowed, it would be very hard for compilers to eliminate any unnecessary reads.
If I write a routine which walks an array one element at a time, in order, up to some max index N, and also stops early when it reaches some other condition (like an element equal to zero), then I am allowed to pass in memory which is only 3 elements long, and an N greater than 3, if I know that there is a zero in the first 3 elements. The function must not be optimized to read past the zero element, regardless of the N passed in, so removing the early exit would be an invalid optimization.
There's no undefined behavior in the above that justifies that optimization.
No, not in this case. The original function is well defined and has no undefined behaviour (in the case I describe) as it would return before it reached "bad memory". The optimised version is what reaches further through memory (while vectorising).
The compiler is allowed to eliminate unnecessary reads; but vectorization requires introducing additional reads. That's not allowed in general.
In this case the compiler could vectorize the loop, though it would have to take care that the additional reads are within the same 4K page as the original reads, to prevent introducing segfaults. But that's usually the case for vectorized reads, as long as the compiler takes care of alignment.