uint32_t foo(uint8_t ubyte) { return ubyte << 24; } Look ma, undefined behavior ...

jstimpfle · on March 15, 2018

The first is nice, but I think you should make sure that ubyte really has that many bits for it to make sense, no? And doesn't your compiler emit a warning if it takes advantage of that undefined behaviour? (e.g. using the upper part of the register where that byte is stored).

The second, why do you cast as uintptr_t? It makes no sense to me at all. Yep, it might be a little bit advanced, but it's well known that you should only have correctly typed pointers, or void pointers, or char pointers, to any object. It's not a totally obscure thing, just as most people understand the need for "restrict". I'd say the code looks bad to me at first sight, and even a beginner is unlikely to write it because it's a complicated construction.

You have your point, but anyway the pointer-type rule makes sense and I don't know a better version of the rule that could prevent many unnecessary reloads.

aidenn0 · on March 15, 2018

With the first, John Regehr found many, many examples of it in real world code, and in some cases the bug he filed was closed without fixing it because "no compiler currently will do the wrong thing with it." I'm not aware of any compiler that emits a warning for this with default compilation flags.

The second is based off of code where an operating system call would get the mapped ranges of a shared memory region. For whatever reason, it took a MemoryAddress which was essentially a pre-c99 uintptr_t defined by the system headers (an unsigned value large enough to hold an address). The casts were there because the compiler warned about incompatible pointer types.

I was never able to convince the engineer who wrote the code (~20 years of C experience) that this code was wrong (it worked fine until intermodule inlining was enabled). Instead we just enabled the compiler option to disable optimizations based off of aliasing rules.

You are both greatly overestimating the degree to which programmers understand C, and overestimating the degree to which programmers who do understand C do not make mistakes. I've had lots of people say that the code in the second example "looks bad to me" but the commit still made it through code review and was being run for 5 years before optimization flag changes caused bugs to appear.

jstimpfle · on March 15, 2018

> "no compiler currently will do the wrong thing with it."

Well then, that's nice no? I don't like to play language lawyer. If there are machines were it can lead to errors, it might be bad machine design. The compiler or the user should work together to catch the error / specify more precisely what is the intent. If we can assume that (ubyte << n) == 0 for any n >= 8, I'm very fine with that, too.

I don't think compilers should exploit every undefined behaviour in the C standard. Some of those might be there only to support rare, quirky architectures.

> For whatever reason, it took a MemoryAddress which was essentially a pre-c99 uintptr_t defined by the system headers

That's inconvenient that the API forces such a strange type to the user. But I think the logical usage is to declare "start" and "end" as uintptr_t then. In this case there should be no problem, or do I miss something?

> You are both greatly overestimating the degree to which programmers understand C, and overestimating the degree to which programmers who do understand C do not make mistakes.

All I can say is I've never had this kind of problem really, so far. The kinds of problems I get by using safe toys on the other hand (e.g. array access only possible through proxy containers) are incredibly more painful, because these approaches are extremely bad for modularity and elegance. You have to specify the container, and also its type, or at least implement all kinds of boilerplate interfaces. This is incredibly restricting.

Everybody makes mistakes and while some memory safety problems definitely happen regularly even to very experienced coders, and are harder to detect than e.g. an exception, they are also the minority of the problems I encounter, even when counting the occasional buffer overrun or similar. Even in terms of data security, I figure more exploits are high-level (i.e. logic errors not catched in a high-level language - SQL injection or other auth bypass) than clever low-level ones. And not every type of program is extremely security-sensitive.

aidenn0 · on March 16, 2018

>> "no compiler currently will do the wrong thing with it."

> Well then, that's nice no? I don't like to play language lawyer. If there are machines were it can lead to errors, it might be bad machine design. The compiler or the user should work together to catch the error / specify more precisely what is the intent. If we can assume that (ubyte << n) == 0 for any n >= 8, I'm very fine with that, too.

> I don't think compilers should exploit every undefined behaviour in the C standard. Some of those might be there only to support rare, quirky architectures.

In this case, it would be legal for a compiler to assume that ubyte is less than 128. It's not unreasonable to assume that at some point in the future a compiler writer will discover a way to get a 0.1% improvement on whatever the current artificial benchmark of the day is and implement an optimization for it. The landscape of C libraries is littered with undefined code that used to work everywhere and now doesn't for just such a reason.

>> For whatever reason, it took a MemoryAddress which was essentially a pre-c99 uintptr_t defined by the system headers

> That's inconvenient that the API forces such a strange type to the user. But I think the logical usage is to declare "start" and "end" as uintptr_t then. In this case there should be no problem, or do I miss something?

Yes, the correct way is to declare them as uintptr_t and then assign them to a pointer of whatever type you need. It's not uncommon for systems software to treat addresses as integers in places, because they actually do represent specific addresses. I'd have to re-read the specification, but I think that having it take a void instead of a uintptr_t * would have the exact same issue, so I don't think it's the odd API choice that causes this.

jstimpfle · on March 16, 2018

I don't think there should be a problem. I think you can cast to void * and back as much as you like, but you aren't supposed to ever access the object as something else than char or its original type.

Yoric · on March 16, 2018

> I don't think compilers should exploit every undefined behaviour in the C standard. Some of those might be there only to support rare, quirky architectures.

Well, that is one way of looking at it. My personal experience is that if there is an undefined behavior in the C standard, it will be exploited, eventually. I have seen this break millions-of-lines applications during compiler upgrades and/or flag-twiddling. As you can imagine, debugging that problem was a nightmare.

[...]

> Everybody makes mistakes and while some memory safety problems definitely happen regularly even to very experienced coders, and are harder to detect than e.g. an exception, they are also the minority of the problems I encounter, even when counting the occasional buffer overrun or similar. Even in terms of data security, I figure more exploits are high-level (i.e. logic errors not catched in a high-level language - SQL injection or other auth bypass) than clever low-level ones. And not every type of program is extremely security-sensitive.

I believe that you are right in terms of number of exploits, but for the wrong reasons.

I can't find the statistics on this, but from the top of my head, ~20 years ago, 80% of the security advisories were buffer overflows/underflows. At the time, just about everything was written in C/C++.

These days, we see lots of high-level exploits, but I would tend to assume that the reason is that 1/ the web makes it very easy to carry out attacks; 2/ C and C++ have a very small attack perimeter on the web, even if you include Apache/nginx/... as part of the web.

Also, yes, web applications, especially in dynamically typed languages also would require military-grade testing :)

wahern · on March 15, 2018

The first is only undefined on 16 bit targets as ubyte is first promoted to int, and int is always at least 16 bits but on typical general purpose machines 32 bits. And in any event modern C compilers will emit a diagnostic when constant shifts are undefined--just not in this case because it's not actually wrong.

The latter problem[1] is applicable to other languages, including Rust, when they permit such type punning. C supports type punning because occasionally it's useful and necessary. The above invokes UB on Rust, too. Type punning has such a horrible code smell that you don't need unsafe{} as a gateway. Maybe for very inexperienced programmers, but they shouldn't be writing C code in situations where type punning suggests itself anymore than they should be writing assembly or unsafe Rust code.

C has many pitfalls, but I don't think I'm being overly pedantic here. There's no question C gives you a significant amount of rope to hang yourself (sometimes affirmatively but usually by turning its head when you steal some), but all languages give you slack to escape the limitations of how they model the environment--the points of contention are how much slack and at what cost. Arguing that requires more precision.

For example, even if we can agree that automatic arithmetic conversions are on balance a bad idea, it's still the fact that there's some benefit to that behavior, such as making your first example well-defined. That's not coincidental.

[1] It's actually only a problem if that routine actually stores the value through a uintptr_t pointer. If it casts the parameter to pointer-to-pointer-to-char before the store it's fine. You can cast a pointer to whatever you want, as many times as you want, and never will the casting, alone, change the semantics of the code. It's only actual load or store expressions that matter, and specifically the immediate pointer type they occur through.

Where you typically find code like this you do the correct thing--the abuse of function parameter types in this particular manner is usually because of const-ness headaches and the lack of function overloading (though now there's _Generic), and in those cases you're already paying attention to type punning pitfalls because you're already at a point where you're trying to skirt around relatively strict, inconvenient typing. If you're not then, again, you're not going to be doing the right thing when writing unsafe Rust code--and there are many more cases where Rust's strict typing is inconvenient, so arguably there's _greater_ risk with a language like Rust when the programmer is lazy or tired and resorts to subverting the type system.

Moreover, this isn't even a pointer aliasing problem. Pointer aliasing problems occur when the _order_ of loads or stores matters, but because of type punning the compiler can neither determine nor assume that access through variables of _different_ type point to the same object and thus it mustn't reorder the sequence. Not only is your example not a case of a single object accessed through multiple types across dependent loads and stores, it's not even a case of accessing through the same type. Unless you use the restrict qualifier, the compiler always assumes that two variables of the same pointer type alias the same object. Whether it's the _wrong_ type is a different, less consequential problem related to the potential for trap representations or alignment violations. But if the hardware doesn't care you'll get expected behavior; it's not the type of pitfall that threatens nasal daemons.

aidenn0 · on March 16, 2018

> The first is only undefined on 16 bit targets as ubyte is first promoted to int, and int is always at least 16 bits but on typical general purpose machines 32 bits. And in any event modern C compilers will emit a diagnostic when constant shifts are undefined--just not in this case because it's not actually wrong.

If ubyte is greater than or equal to 128, then converting it to a signed 32-bit integer (as on a 32 bit machine) and then shifting it left by 24 causes signed integer overflow. This is undefined behavior. Therefore the compiler is allowed to make optimizations that assume ubyte is less than 128.

jstimpfle · on March 16, 2018

Why is the ubyte promoted to int? I'd always assumed it would stay the same size.

kccqzy · on March 17, 2018

Look up integer promotion rules. Basically almost all arithmetic operators promote their operands to int/unsigned int before doing anything.

And you are the person who replied to my earlier post saying "don't do things that do not have intuitively clear semantics." Do integer promotion rules count as intuitively clear semantics? Most C programmers never learned them and therefore the semantics is confusing. For the few C programmers who learned them, the semantics is clearer when they can remember the rules, and confusing otherwise.

Now I hope I've convinced you that there are a lot of such subtle semantics issues in C, and hardly anything is intuitively clear, unless you've been a language lawyer or programmed in C for long enough.

jstimpfle · on March 17, 2018

Point taken. I agree that the weakly typed nature of C (integer promotions / implicit casts) can be problematic. Would be interesting to see if there are programs that would be painful to write with a stricter model.

It seems I've never really done "overshifting", otherwise I would have easily noticed that the shifted value is promoted first. If you don't overshift there's no way to trigger the undefined behaviour, even when you shift multiple times - since by assigning to the smaller type the result gets cut down again, effectively leading to the behaviour I'd assumed. I would hardly call this a gotcha.

Where it might be confusing is in a situation like this:

    char x = 42;
    printf("%d\n", x << 4);

But then again I'm undecided if the behaviour is that bad. It might be even useful.