Not really mysterious or a surprise. Passing count as a reference was a huge red...

Diggsey · on May 11, 2022

It would be a surprise to anyone not very familiar with C++'s strict-aliasing rule, which I imagine is quite a lot of people (and even many C++ programmers...)

Very few other languages have a similar rule (eg. Rust explicitly does not have this rule)

kevincox · on May 11, 2022

Although Rust doesn't have this rule it isn't nearly as relevant due to the way immutable (shared) references work. While `a: &Foo` and `b: &Bar` may alias it doesn't matter much because Rust knows that nothing can write to them so it can do basically all of the optimization it needs. Rust also knows that `c: &mut T` doesn't alias with anything so in this case it can make loads of assumptions.

One chink in this rule is interior mutability. This is why it is better to think of `&T` as a shared reference than an immutable reference. For example in the following code Rust can't assume that `a.get() == 5`. (Even if the second argument is changed to i8)

    pub fn test(a: &std::cell::Cell<u8>, b: &std::cell::Cell<u8>) -> u8 {
        a.set(5);
        b.set(7);
        a.get()
    }

https://rust.godbolt.org/z/e1c6Kx4eb

Commenting out the write to b does allow Rust to hardcode the return value as 5.

ahefner · on May 11, 2022

The surprising thing to me is that 'char8_t' (a new C++ 20 thing I'd never heard of) is not just a typedef to char with all the same aliasing implications, but a new and distinct type to which the magic 'char' alias rules don't apply (also, unsigned).

ncmncm · on May 11, 2022

And also distinct from std::byte, which they are hoping will pick up aliasing properties of char, allowing use and also abuse of char* to be someday eliminated from new code. std::byte does alias everything, but no operations are defined on it except copying (and, weirdly, bitwise operators).

MauranKilom · on May 11, 2022

Also, char is a distinct type from both signed char and unsigned char (even though it has the same size as both and the same signedness as one of them).

jhgb · on May 11, 2022

Is signed char and unsigned char subject to the same "can alias anything" rule? I never bothered to think about this in the past. Now I'm not sure. I've known that the three types are distinct, but that just means that that the rule doesn't have to apply to all of them, not that it doesn't.

saagarjha · on May 11, 2022

Yes, the rule applies to all three character types.

jhgb · on May 11, 2022

Thanks. This seems mildly ungoogleable, or at least I haven't been able to find a good search result for this. So I'll keep in mind that these are three distinct types with the same aliasing behavior.

saagarjha · on May 11, 2022

Actually, I take that back and should clarify: it's all three only in C. In C++ it is just char and unsigned char. (If you want to search for this, "signed char aliasing" gave me good results.)

quibono · on May 12, 2022

Is there any difference in using char vs unsigned char then? I understand these two types are different as far as the compiler's concerned. Would they behave differently too?

MauranKilom · on May 12, 2022

signedness of char is up to the implementation. You should not rely on char behaving like unsigned char.

Sharlin · on May 11, 2022

It basically has to be distinct from `char` because you can't use portably use `char` to hold a UTF-8 code unit (because the guaranteed valid range is only 0x00 to 0x7F) Also, this way you can overload based on legacy char vs UTF-8 char, and have `std::basic_string<char>` and `std::basic_string<char8_t>` (aka `std::string` and std::u8string`) be distinct types as well. So finally in C++20 we actually have a portable UTF-8 string type!

ncmncm · on May 11, 2022

Anyway, a UTF-8 sequence transport type. Few of u8string member operations make sense for UTF-8 as such. Usually that doesn't matter. Sometimes it matters a great deal, and we will need a whole new API for that.

Sharlin · on May 11, 2022

Yeah, good point.

planede · on May 11, 2022

> you can't use portably use `char` to hold a UTF-8 code unit

That's not true, in C++ a byte is guaranteed to be able to hold a UTF-8 code unit.

https://timsong-cpp.github.io/cppwp/n4868/intro.memory#1.sen...

Sharlin · on May 11, 2022

Yes, but if `char` is signed, as it usually is, its bit patterns correspond to values -0x80 to 0x7F. So yeah, you can no-cost encode the >=0x80 code units as their two’s complement counterparts but it feels suspicious. At least to me, after writing some Rust lately which very much does not do implicit signed–unsigned conversions. Much better for char to always represent the "basic character set" (ie. usually ASCII) and have a distinct type for UTF-8.

matheusmoreira · on May 11, 2022

I agree, it's surprising. Types like uint8_t* are widely used to reinterpret other structures as byte arrays which implies they are universal aliases just like char*. Not sure what makes char8_t different.

wahern · on May 11, 2022

> Types like uint8_t* are widely used to reinterpret other structures as byte arrays which implies they are universal aliases just like char*

Alternatively, it implies that there's alot of broken code out there. So much broken code that they've accidentally found safety in numbers, and compilers are unlikely to change a coincidental behavior upon which they wrongly relied. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66110

rsstack · on May 11, 2022

TIL: https://stackoverflow.com/questions/98650/what-is-the-strict...

ReactiveJelly · on May 11, 2022

It surprised me, cause I forgot that aliasing is a problem for C. Guess I'm one of today's 10,000.

jhgb · on May 11, 2022

I'm sort-of, kind-of aware of the issue but what surprised me was how std::string pulls this into picture here. Is this because of what str[i] gets compiled into when str is a basic_string of chars? That seems non-obvious to me because I'm just accustomed to dealing with C++ strings as fairly opaque entities. If they somehow expose to the compiler that you're really manipulating characters via pointers, I can understand how this complicates things.

bonzini · on May 11, 2022

Replace *count with this->count and you can see why this can cause pessimization.

fguerraz · on May 11, 2022

What would that change? If this->count was a pointer you'd still have to de-reference it explicitly. Or am I missing the point of your argument?

bonzini · on May 11, 2022

I agree that in this toy example passing a count argument by reference makes little sense, but you'd get the same thing if you accessed the count field field of

    struct s {
        int count;
        char *chars;
    };