From the godbolt link, it looked like most of the vector operations were not getting inlined. You'll need -O2 or higher to have representative results.
I could counter by just writing a Virgil program and turning off bounds check with a compiler flag. We could stare at the machine code together; it's literally an additional load, compare, and branch. The Virgil compiler loves eliminating bounds checks when it can. I know your original comment was in the context of the STL, but it really muddies the waters to see huge piles of code get inlined or not depending on compiler flags. Machine code is what matters.
Regardless, this is still microbenchmarking. Maybe matrix multiply is an important kernel, but we need to benchmark whole programs. If I turn off bounds checks in a big program like the Virgil compiler, I cannot measure more than about 1.5% performance difference.
I just used godbolt to quickly share the code. On my computer I tried with -Ofast -march=native (broadwell)
> I could counter by just writing a Virgil program and turning off bounds check with a compiler flag. We could stare at the machine code together; it's literally an additional load, compare, and branch.
This sounds like Virgil is not vectorizing, which makes the comparison much less useful.
That's just even more confounding variables. We are, after all, not even doing experiments with Rust vectors, which is what your original comment was about. You wrote examples in C++ and we went down a wormhole already, but I let it slip since at least it was empirical. But I think we shouldn't get distracted with even more side alleys now. Bounds checks are literally just a couple machine instructions, and often eliminated by the compiler anyway, which enables vectorization.
> We are, after all, not even doing experiments with Rust vectors,
they are very close to C++ ones, down to the stack unwinding to report the panic in case of bound error if I'm not mistaken.
> That's just even more confounding variables.
No they are not. We are trying to see how much bound checks costs. If you compare between suboptimal programs the comparison is meaningless (or rather not interesting to anyone) - the starting point has to be the absolute best performance that it is possible to get, and add the worst-case bound-checking (I'm happy for you if you never have to worry about the worst case though!)
> Bounds checks are literally just a couple machine instructions, and often eliminated by the compiler anyway
please provide a source for this ? sure, if you use spans as other commenters mentioned, that moves the checking at the span creation time but that only works for the simplest cases where you are going to access linearily - and I would say that it's a library feature rather than a compiler one.
for(double v : vec) { } // or any sub-span you can take from it
in C++ also does not need bound-checks by design but this kind of construct also utterly does not matter for so HPC workloads.
I can look into my pdfs folders and bring out dozens of papers where the core algorithms all use funky non-linear indexing schemes where you cannot just iterate a range (algorithms based on accessing i, i-1, i+1, or accessing i and N-i, or accessing even / odd values, etc etc) - how would you implement an FFT for instance ? This is the code that matters !
A compiler can do loop versioning for that. And they do. Hotspot C2 (and Graal, too) does a ton of loop optimizations, partitioning the index space into in-bounds and potentially out-of-bounds ranges, unrolling loops, peeling the first iteration off a loop, generating code before a loop to dispatch to a customized version if everything will be in bounds.
When a compiler is doing that sophisticated of loop transforms, you are not measuring the cost of bounds checks anymore, you are measuring a whole host of other things. And sometimes if a loop is just a little different, the results can be disastrously bad or miraculously good. Which is why microbenchmarking is so fraught with peril. A microbenchmark might be written assuming a simple model of the computation and then a sophisticated compiler with analyses never dreamed of comes along and completely reorganizes the code. And it cuts both ways; a C++ compiler might do some unrolling, fusion, tiling, vectorization, software pipelining or interchange on your loops and suddenly you aren't measuring bounds check cost anymore, but loop optimization heuristics that have been perturbed by their presence. You end up studying second-order effects and not realizing it. And it keeps running away from you the more you try to study it.
>> That's just even more confounding variables.
> No they are not. We are trying to see how much bound checks costs. If you compare between suboptimal programs the comparison is meaningless (or rather not interesting to anyone) - the starting point has to be the absolute best performance that it is possible to get, and add the worst-case bound-checking (I'm happy for you if you never have to worry about the worst
We are always comparing suboptimal programs. No compiler is producing optimal code, otherwise they would dead-code eliminate everything not explicitly intertwined into program output, statically evaluate half of our microbenchmarks, and replace them with table lookups.
You're going down a linear algebra rabbit hole trying to come up with a result that paints bounds checks in the worst possible light. If this is the real problem you have, maybe your linear algebra kernels would be better off just using pointers, or you could even try Fortran or handwritten assembly instead, if it is so important. Unsafe by default is bad IMHO. For real programs bounds checks really don't matter that much. Where this thread is going is all about loop optimizers, and they don't really get a chance to go nuts on most code, so I think we're way off in the weeds.
Note that Rust has unchecked index, you just have to explicitly ask for that if you want it. You can even - if you just insist on writing cursed code - which it seems jcelerier is, write your own type in which the index is always unchecked and it is Undefined Behaviour when you inevitably make a mistake. Just implement std::ops::Index and if you also want to mutate these probably invalid indices, std::ops::IndexMut and in your implementation say you're unsafe and just don't bother doing bounds checks.
You can shoot yourself in the foot with Rust, it's just that you need to explicitly point a loaded gun at your foot and pull the trigger, whereas C++ feels like any excuse to shoot you in the foot is just too tempting to ignore even if you were trying pretty hard not to have that happen.
C++ devs that actually care about security the same way as titzer, do turn on the security checks that are disabled by default, thus you can have a same experience.
Example, on Visual Studio define _ITERATOR_DEBUG_LEVEL to 1, enable /analize as part of the build.
While not Rust like, it is already much better than not caring.
I think you’re confusing two different senses of “vector”, there is the contiguous series of items “vector” data structure that both C++ and Rust have in their standard libraries that’s used in the code example, and there’s “vectorizing” which is an optimization to use things like SIMD to operate on multiple things at the same time.
I could counter by just writing a Virgil program and turning off bounds check with a compiler flag. We could stare at the machine code together; it's literally an additional load, compare, and branch. The Virgil compiler loves eliminating bounds checks when it can. I know your original comment was in the context of the STL, but it really muddies the waters to see huge piles of code get inlined or not depending on compiler flags. Machine code is what matters.
Regardless, this is still microbenchmarking. Maybe matrix multiply is an important kernel, but we need to benchmark whole programs. If I turn off bounds checks in a big program like the Virgil compiler, I cannot measure more than about 1.5% performance difference.