Summary of C/C++ integer rules

MaxBarraclough · on April 2, 2021

[Dons language lawyer hat]

> floating-point number types will not be discussed at all, because that mostly deals with how to analyze and handle approximation errors that stem from rounding. By contrast, integer math is a foundation of programming and computer science, and all calculations are always exact in theory (ignoring implementations issues like overflow).

Integer overflow is no mere implementation issue, any more than errors are an implementation issue with floating-point.

> Unqualified char may be signed or unsigned, which is implementation-defined.

> Unqualified short, int, long, and long long are signed. Adding the unsigned keyword makes them unsigned.

There's an additional point here that's not mentioned: char, signed char, and unsigned char are distinct types, but that's only true of char. That is, signed int describes the same type as int. You can see this using the std::is_same type-trait with a conforming compiler. Whether char behaves like a signed integer type or an unsigned integer type, depends on the platform.

> Signed numbers may be encoded in binary as two’s complement, ones’ complement, or sign-magnitude; this is implementation-defined.

This is no longer true of C++. As of C++20, signed integer types are defined to use two's complement. [0] I don't think C intends to do the same.

> Character literals (in single quotes) have the type (signed) int in C, but (signed or unsigned) char in C++.

That's not correct. In C++, the type of a character literal is simply char, never signed char nor unsigned char. As I mentioned above, whether char is signed depends on the platform, but it's always a distinct type.

> Signed division can overflow – e.g. INT_MIN / -1.

This isn't just overflow, it's undefined behaviour.

> Counting down

> Whereas an unsigned counter would require code like:

> for (unsigned int i = len; i > 0; i--) { process(array[i - 1]); }

That's one solution, but it might be a good place for a do/while loop.

[0] https://stackoverflow.com/q/57363324/

quietbritishjim · on April 2, 2021

> char, signed char, and unsigned char are distinct types, but that's only true of char.

That's correct, I was going to bring that up too.

This is particularly important because char and unsigned char are special in that they are an exception the aliasing rules. That is, in this function:

    float foo(char* cp, float* fp) {
        *fp = 7;
        return *(float*)cp;
    }
    /* ... */
    float f = 2;
    float g = foo((char*)&f, &f);

Then g should end up equal to 7. That's true even if you change the type of the cp parameter to const char*! If you change "char" to "unsigned char" in both places then its behaviour stays the same, but if you change it to "signed char" in both places then it has undefined behaviour (if I've remembered everything correctly). Now I think about it, this conflation of char's use in the C standard has probably prevented a lot of optimisations where code was just using char* for strings rather than for potential aliasing.

Another point, which is very related, is that uint8_t and int8_t do not necessarily have to be a typedef for unsigned char / signed char or char, even if char is 8 bits wide. So you could end up with (at least) 5 types that are 8-bit wide!

Combined with the above aliasing rules only applying to char and unsigned char, that means you cannot reliably expect uint8_t to have that aliasing exception. Indeed, gcc originally made a new type of uint8_t and int8_t but that caused so many bugs that they ended up switching them to unsigned char and char (and I think Visual Studio has always done so).

> > Character literals (in single quotes) have the type (signed) int in C, but (signed or unsigned) char in C++.

> That's not correct. In C++, the type of a character literal is simply char, never signed char nor unsigned char.

I was going to bring this up too, although I wouldn't quite say it's outright incorrect because I'm not sure they were making the claim you think they were - it could be interpreted to mean that it's always char in C++ but by the way don't forget that could be a signed or unsigned type (note the lack of monospace font for their use of "signed" and "unsigned"). But probably best not to overanalyse it since they probably didn't know the types were distinct - the main thing is reiterate, as you've done, that it's always `char` regardless of whether that's signed or unsigned.

logicchains · on April 2, 2021

>So you could end up with (at least) 5 types that are 8-bit wide!

Don't forget std::byte.

lifthrasiir · on April 2, 2021

It is not a full-featured arithmetic type though. It doesn't implement operator+/-/* etc.

dataflow · on April 2, 2021

>> Character literals (in single quotes) have the type (signed) int in C, but (signed or unsigned) char in C++.

> That's not correct. In C++, the type of a character literal is simply char, never signed char nor unsigned char.

I'd assume the author meant (signed `char` | unsigned `char`) rather than (`signed char` | `unsigned char`).

nayuki · on April 4, 2021

To put it better, I meant to write that C++ character literals have type `char`, which in turn maps to either `signed char` or `unsigned char`.

focus2020 · on April 2, 2021

What is the reference to "Dons"?

MaxBarraclough · on April 2, 2021

Don is a somewhat uncommon verb, To put on clothing. https://en.wiktionary.org/wiki/don#Verb

saagarjha · on April 2, 2021

Signed overflow is undefined behavior.

quietbritishjim · on April 2, 2021

That seems to be exactly what the parent comment said.

MaxBarraclough · on April 2, 2021

I think saagarjha's point was that the article already points out that signed overflow causes undefined behaviour. That's true, but I think it still bears emphasising that (INT_MIN / -1) causes undefined behaviour.

rualca · on April 2, 2021

> This is no longer true of C++. As of C++20, signed integer types are defined to use two's complement. [0] I don't think C intends to do the same.

As no good language lawyer discussion should be free from pedantry, there is no such thing as "As of C++20". C++20 is just a new version of the C++ standard. Projects that target C++11 or C++14 or C++17 are all still here and won't go away any time soon, and the respective C++ rule still apply to them. Passing a new revision of the C++ standard changes nothing with regards to which rules actually apply to those projects, unless project maintainers explicitly decide to migrate their projects.

jcelerier · on April 2, 2021

> C++20 is just a new version of the C++ standard.

and per ISO rules, older versions are withdrawn (as can be confirmed for C++ here: https://www.iso.org/standard/79358.html) and not to be used anymore: https://www.iso.org/files/live/sites/isoorg/files/store/en/P...

    Other reasons why a committee may decide to propose a standard for withdrawal include the following :
    ▸ ▸ the standard does not reflect current practice or research
    ▸ ▸ it is not suitable for new and existing applications (products,
    systems or processes)
    ▸ ▸ it is not compatible with current views and expectations
    regarding quality, safety and the environment

0xffff2 · on April 2, 2021

Wow. Didn't know this. It doesn't have any bearing whatsoever on reality though. If it did, I wouldn't still be writing C++98 conformant C++.

jcelerier · on April 2, 2021

well, your code is nonstandard, that is all, just like a house with powerplugs installed 30 years ago is not standard, even if it "works"

saagarjha · on April 2, 2021

> Having signed and unsigned variants of every integer type essentially doubles the number of options to choose from. This adds to the mental burden, yet has little payoff because signed types can do almost everything that unsigned ones can.

Unsigned types are quite useful when doing bit twiddling because they don't overflow or have a bit taken up by the sign.

enriquto · on April 2, 2021

> Unsigned types are quite useful when doing bit twiddling because they don't overflow or have a bit taken up by the sign.

That's essentially their only application. The rest are stupid single-bit memory-size optimizations. As Jens Gustedt noted, it's one of the (many) misnomers in the C language. It should be better called "modulo" instead of "unsigned". Other such misnomers that I recall:

    unsigned -> modulo
    char     -> byte
    union    -> overlay
    typedef  -> typealias
    const    -> immutable
    inline   -> negligible
    static   -> intern
    register -> addressless

EDIT: found the reference https://gustedt.wordpress.com/2010/08/18/misnomers-in-c/

dkersten · on April 2, 2021

> That's essentially their only application.

What about when it doesn't make semantic sense to have negative values? Eg for counting things, indexing into a vector, size of things. If negative doesn't make sense, I use unsigned types. Its not about the memory-size in that case.

adrian_b · on April 2, 2021

While I also like to use unsigned numbers when that is the correct type of a variable, the C language does not really have support for unsigned integers.

As someone else already said, the so called "unsigned" integers in C are in fact remainders modulo 2^N, not unsigned integers.

While the sum and the product of 2 unsigned integers is also an unsigned integer, the difference of 2 unsigned integers is a signed integer.

The best behavior for a programming language would be to define correctly the type of the difference of 2 unsigned integers and the second best behavior would be to specify that the type of the result is unsigned, but to insert automatically checks for out-of-___domain results, to detect the negative results.

As C does not implement any of these behaviors, whenever using unsigned integers you must either not use subtraction or always check for negative results, unless it is possible to always guarantee that negative results cannot happen.

This is a source of frequent errors in C when unsigned integers are used.

The remainders modulo 2^N can be very useful, so an ideal programming language would support signed integers, unsigned integers and modular numbers.

UncleMeat · on April 2, 2021

If negative doesn't make sense then you are saving one bit using this method, but introducing a ton of fun footguns involving things like conversions. Further, the compiler cannot assume no overflowing and must now do extra work to handle those cases in conforming fashion, even if your value width doesn't match the CPU width. This can make your code slower!

chrchang523 · on April 2, 2021

Also, go to the Compiler Explorer and compare the generated code for C++ "num / 2" when num is an int, and when num is an unsigned int.

While there are a few cases where the compiler tends to do a better job of optimizing signed ints than unsigned ints (generally by exploiting the fact that signed integer overflow is undefined), they are not as fundamental as "num / 2". Being forced to write "num >> 1" all over the place whenever I care about performance is basically a dealbreaker for me in many projects; and I haven't even gotten into the additional safety issues introduced by undefined overflow.

enriquto · on April 2, 2021

Positive values are a particular case of signed values, you can still use signed ints to store positive values. No need to enforce your semantics through type, and especially not when the values of the type are trivially particular cases of the values of another type. For example, when you write a function in C that computes prime factors of an int, do you need a type for prime numbers? No, you just use int. The same thing for positive numbers, and for even numbers, and for odd numbers. You can and should do everything with signed integers, except bitfields, of course.

dkersten · on April 2, 2021

> No need to enforce your semantics through type

Maybe I'm spoiled by other languages with more powerful type systems, but this is exactly what I want my types to do! Isn't this why we have type traits and concepts and whatnot in C++ now? If not for semantics, why have types at all, the compiler could figure out what amount of bytes it needs to store my data in, after all.

I use types for two things: to map semantics to hardware (if memory or performance optimization are important, which is rare) and to enforce correctness in my code. You're telling me that the latter is not a valid use of types and I say that's the single-biggest reason I use statically typed languages over dynamically typed languages, when I do so.

But even if that's not the case, why would I use a more general type than I need, when I know the constraints of my code? If I know that negative values are not semantically valid, why not use a type that doesn't allow those? What benefit would I get from not doing that? I mean, why do we have different sizes of integers when all the possible ones I could want can be represented as a machine-native size and I can enforce size constraints in software instead? We could also just use double's for all numbers, like some languages do.

jcelerier · on April 2, 2021

> Maybe I'm spoiled by other languages with more powerful type systems, but this is exactly what I want my types to do! Isn't this why we have type traits and concepts and whatnot in C++ now? If not for semantics, why have types at all, the compiler could figure out what amount of bytes it needs to store my data in, after all.

yes, but understand that, despite the name, what unsigned models in C / C++ is not "positive numbers" but "modulo 2^N" arithmetic (while signed models the usual arithmetic).

There is no good type that says "always positive" by default in C or C++ - any type which gives you an infinite loop if you do

    for({int,unsigned,whatever} i = 0; i < n - 1; i++) {
       // oops, n was zero, n - 1 is 4.something billion, see you tomorrow
    }

is not a good type.

If you want a "always positive" type use some safe_int template such as https://github.com/dcleblanc/SafeInt - here if you do "x - y" and the result of the computation should be negative, then you'll get the rightful runtime error that you want, not some arbitrarily high and incorrect number

The correct uses of unsigned are for instance for computations of hashes, crypto algorithms, random number generation, etc... as those are in general defined in modular arithmetic

oddthink · on April 2, 2021

+1 for this. I was just bitten by this last week, when I switched from using a custom container where size() was an int to a std::vector where size() is size_t.

The code was check-all-pairs, e.g.

  for (int i = 0; i < container.size() - 1; ++i) {
    for (int j = i + 1; j < container.size(); ++j) {
      stuff(container[i], container[j]);
    }
  }

Which worked just fine for int size, but failed spectacularly for size_t size when size==0.

I totally should have caught that one, but I just couldn't see it until someone else pointed it out. And then it was obvious, like many bugs.

jcelerier · on April 2, 2021

I recommend using -fsanitize=undefined -fsanitize=integer if you can build with clang - it will print a warning when an unsigned int underflows which catches a terrifying amount of similar bugs the first time it is run (there are a lot of false positives in hash functions, etc though but imho it's well worth using regularly)

enriquto · on April 2, 2021

Would you really write a function find_prime_factors() that takes an input of type "integer" and an output of type "prime", that you have previously defined? Then if you want to sum or multiply such primes you have to cast them back to integers. Maybe it makes sense for you, but for me this is the textbook example of useless over-engineering.

The same ugliness occurs when using unsigned types to store values that happen to be positive. Well, in that case it is even worse, because it is incomplete and asymmetric. What's so special about the lower bound of the possible set of values? If it's an index to an array of length N, you'll surely want an integer type whose values cannot exceed N. And this is a can of worms that I prefer not to open...

dkersten · on April 2, 2021

> Would you really write a function find_prime_factors() that takes an input of type "integer" and an output of type "prime", that you have previously defined?

If the language allows me to and its an important semantic part of my program, then yes. The same way as I would create types for units that need conversion.

Unless I'm writing low level performance sensitive code, yes, I want to encode as much of my semantics as I can, so that I can catch mistakes and mismatches at compile time, make sure units get properly converted and whatnot.

> What's so special about the lower bound of the possible set of values?

Nothing, I would encode a range if I can. But many things don't have a knowable upper-bound but do have a lower bound at zero: you can't have a negative size (for most definitions of size), usually when you have a count of things you don't have negatives, you know that a dynamically sized array can never have an element index less than 0, but you may not know the upper bound.

Also, the language has limitations, so I have to work within them. I don't understand your objection for using what is available to make sure software is correct. Also, remember that many of the security bugs we've seen in recent years came about because of C not being great at enforcing constraints. Are you really suggesting not to even try?

> And this is a can of worms that I prefer not to open...

And yet many languages do and even C++20 is introducing ranges which kind of sort of fall into this space.

giomasce · on April 2, 2021

To me it could totally make sense. It depends on the context, but I can very well see contexts where such a choice could make sense. For example, in line of principle it would make sense, for an RSA implementation, to accept to construct a type PublicKey only computing the product of two Prime's, and not two arbitrary numbers. And the Prime type would only be constructible by procedures that provably (perhaps with high probability) generate an actual prime number. It would be a totally sensible form of defensive programming. You don't want to screw up your key generation algorithm, so it makes sense to have your compiler help you to not construct keys from anything.

For the same reason, say, in an HTTP server I could store a request as a char* or std::string, but I would definitely create a class that ensures, upon construction, that the request is valid and legitimate. Code that processes the request would accept HTTPRequest, but not char*, so that unverified requests cannot even risk to cross the trust boundary.

UncleMeat · on April 2, 2021

But "unsigned" doesn't actually enforce the semantics you want. Missing an overflow check means your value will never be negative, but it is almost certainly still a bug. And because unsigned overflow is defined, the compiler isn't allowed to prevent you from doing it!

This is just enough type semantics to injure oneself.

dkersten · on April 2, 2021

So, because its not perfect, should you throw it all out?

UncleMeat · on April 2, 2021

No. Because people tend to make more mistakes if they try to use unsigned values in this manner in C/C++.

dkersten · on April 3, 2021

I’ve personally never encountered a bug that turned out to be caused by an unsigned value. YMMV, I guess.

UncleMeat · on April 3, 2021

If seen all sorts of bugs caused by surprise conversions, as well as overflows that cause bugs that would be statically detectable but can't become blocking errors because unsigned overflow is well defined.

masklinn · on April 2, 2021

> Positive values are a particular case of signed values, you can still use signed ints to store positive values.

And yet Java's lack of unsigned integers is considered a major example of its (numerous) design errors.

> No need to enforce your semantics through type, and especially not when the values of the type are trivially particular cases of the values of another type.

Of course not, there's no need for any type at all, you can do everything with just the humble byte.

> The same thing for positive numbers

No?

> You can and should do everything with signed integers

You really should not. If a value should not have negative values, then making it so it can not have negative values is strictly better than the alternative. Making invalid values impossible makes software clearer and more reliable.

> except bitfields, of course.

There's no more justification for that than for the other things you object to.

nayuki · on April 4, 2021

Java's lack of unsigned int is widely (but not universally) seen as a deficiency. This is especially true when Java is compared to C#, a very similar language at its core but which does have uint types. Anyway, I have a separate article arguing why Java should not have uint, and many ideas from there can be adapted to C/C++ too: https://www.nayuki.io/page/unsigned-int-considered-harmful-f...

enriquto · on April 2, 2021

Well, you and me are different persons and we don't have to agree on everything. In this case, it seems that we don't agree on anything. But it's still OK, if it works for you ;)

ActorNightly · on April 2, 2021

Thanks for this, gonna add some #defines to my headers :)

renox · on April 2, 2021

> const -> immutable

const -> read_only_view is better

dreinhardt · on April 2, 2021

Which is why sane languages of the time had a bitfield type.

flohofwoe · on April 2, 2021

In the myths section:

> char is always 8 bits wide. int is always 32 bits wide

> Signed overflow is guaranteed to be wrap around. (e.g. INT_MAX + 1 == INT_MIN.)

Are there any current, relevant hardware architectures where this is not true (e.g. bytes are not 8 bits, and integers are not 2's complement)?

E.g. what's the point of "portability" if there is no physical hardware around anymore where those restrictions would apply?

beeforpork · on April 2, 2021

This is the trap with 'undefined behaviour': it has nothing to do with portability, but it is a language level definition.

I.e., if the C std says it's 'undefined', it is not to be avoided for portability reasons (hardware, assembler), but it must not be used, end of story. The portability stuff is called 'implementation defined' in C, not 'undefined behaviour'. The problem is that the compiler can (and will!) exploit undefined behaviour rules. E.g., the following code is officially broken (and not just on weird hardware, but everywhere, as defined by the C std):

  int saturated_increment(int i)
  {
      if ((i + 1) < i) { /* if it overflows, do not inc */
          return i;
      }
      return i + 1;
  }

The compiler may (and many will) remove the whole if() block, because i+1<i is trivially false, because int cannot overflow (says the C standard).

As one can imagine, when compilers started exploiting this, a lot of discussion about sensibility followed. And gcc added -fwrapv among other things.

(And the code would be fine if 'unsigned' was used in stead of 'int', because this is only a problem of signed ints.)

dkersten · on April 2, 2021

"Undefined behavior" really means that the standard doesn't define what should happen and that the compiler is therefore free to do whatever it pleases, under the assumption that such code will never occur.

Reminds me of the examples where the code gets compiled in a way where a branch that returns from the function is unintuitively always taken because the compiler was able to detect that there is undefined behavior later in the function and since undefined behavior isn't legal, it assumed that it therefore can never reach there, so the branch must always get taken and the actual condition check got optimised away (IIRC).

So yeah, undefined behavior isn't "implementation defined" nor "unportable" but rather "illegal not allowed wrong code".

MaxBarraclough · on April 2, 2021

> So yeah, undefined behavior isn't "implementation defined" nor "unportable" but rather "illegal not allowed wrong code".

There are edge-cases even there. Calling a function generated by a JIT compiler is undefined behaviour, but there's a gentleman's agreement that the compiler won't screw it up for you.

Almost all C/C++ compilers promise that floating-point division-by-zero results in NaN (the IEEE 754 behaviour), but according to the C/C++ standards themselves, it's undefined behaviour.

You're right though that in general, one should not be complacent about UB.

giomasce · on April 2, 2021

> There are edge-cases even there. Calling a function generated by a JIT compiler is undefined behaviour, but there's a gentleman's agreement that the compiler won't screw it up for you.

Though you're not writing C/C++ in that case. You're writing "C/C++ for that particular architecture, ABI, OS and compiler".

In general C/C++, if your code is correct every present and future, known and unknown compiler is supposed to generate a correct executable. If they don't, they have a bug. You can pretend to be smarter and go UB, but then the responsibility shifts on you, you have (in principle) to validate each compiler and environment and you can claim no bug on anybody other than you.

MaxBarraclough · on April 2, 2021

Sounds right. If you're doing floating-point work it's not generally a problem to assume that division by zero will result in NaN. Virtually all C and C++ compilers commit to this behaviour in the name of IEEE 754 compliance (even if the IEEE 754 compliance is incomplete).

gallier2 · on April 2, 2021

DSP's have often uncommon sizes. tms320c5502 for example has following sizes: char-- 16 bits short --16 bits int --16 bits long-- 32 bits long long -- 40 bits float-- 32 bits double -- 64 bits

brandmeyer · on April 2, 2021

Indeed. The C28x line by the same company shares CHAR_BIT == 16 with C55. C28x is quite popular in power electronics applications.

"Relevant" is in the eyes of the beholder, and its all too easy to no-true-scotsman your way out of existing architectures. I claim that both of these architectures are relevant by virtue of suppliers continuing to make new chips that use them, and system builders continuing to select those chips in new products.

rocqua · on April 2, 2021

> 40 bits float-- 32 bits double

Isn't double required to have more precision than float?

ericbarrett · on April 2, 2021

I think their formatting got swallowed by HN:

char-- 16 bits

short --16 bits

int --16 bits

long-- 32 bits

long long -- 40 bits

float-- 32 bits

double -- 64 bits

Asraelite · on April 2, 2021

> long long -- 40 bits

Isn't this in direct contradiction to what the article says?

> long long: At least 64 bits, and at least as wide as long.

jstimpfle · on April 2, 2021

I heard that one case where defining int-overflow as wrapping would be very bad for performance is pointer arithmetic - e.g. offset a pointer by i times sizeof (type)). I think the x64 instruction "lea" accomplishes this. If this instruction is used, it is impossible to simulate 32-bit 2's complement overflow by just discarding the upper 32 bits of a 64-bit integer.

So the UB that is associated with overflowing an int is required to efficiently compile loops that use a counter `int i` to index an array. There is a huge number of these loops in the wild.

This problem might be just some unfortunate coincidence with how array indexing is defined in C. I don't understand this deeply, but just wanted to bring it up. I believe I read this on Fabian Giesen's blog.

brandmeyer · on April 2, 2021

I think the nasty cases are in supporting subregister-sized arithmetic. ARMv8 can perform almost any integer operation on its registers either as 64-bit or 32-bit registers.

The classic RISC machines could only perform full-register arithmetic. RISC-V has a small handful of instructions that can accelerate signed subregister arithmetic, but none that accelerate unsigned subregister arithmetic. So, if you need a 32-bit unsigned integer operation to guarantee wrap-around behavior on 64-bit RISC-V, the compiler may have to insert additional zero-extension instruction sequences if it cannot prove the absence of overflow.

simiones · on April 2, 2021

The part about lea doesn't seem especially convincing, it's not hard to imagine that pointer arithmetic could be defined such that overflow is still UB, while allowing regular signed integer arithmetic to overflow safely.

jstimpfle · on April 2, 2021

I can't say much about this. What I know is that in C, pointer arithmetic is defined in terms of "normal" arithmetic. p[i] is defined as *(p + i). And (p + i) means to offset p by (i * sizeof *p), and that multiplication is computed as the type of i (e.g. (32-bit) int or even smaller type)

simiones · on April 2, 2021

That multiplication is entirely implicit, so there is no reason the compiler needs to handle it the same as it handles an explicit multiplication. Given that `p + i` is obviously not an integer addition and it already has much more UB then `i + j`, there is no reason why `i + j` having defined overflow rules needs to mean `p + i` also has them (just like `i + j` is safe for any small enough i and j, while p + i is only meaningful if points within the same object as p (to be fair, its not UB to compute p + i for any i, it's UB to use the value).

tsimionescu · on April 2, 2021

> Are there any current, relevant hardware architectures where this is not true (e.g. bytes are not 8 bits, and integers are not 2's complement)?

For char, not sure, but the problem with signed overflow is not that you can't be sure whether it's 2's complement, it's that the compiler is allowed to assume it won't happen. So, if you read two numbers into 2 ints and add them up, then check for overflow somehow, the compiler will just remove your check while optimizing, since integrr addition can't overflow in a valid program.

SAI_Peregrinus · on April 2, 2021

And the compiler is allowed to remove the check, *even if not optimizing*. -O0 doesn't guarantee it'll be kept.

In practice compilers sould be considered to follow Murphy's Law: Undefined Behavior will work perfectly fine when on a developer's machine or when observed by any support or QA staff, but will occasionally cause intermittent problems on production machines when observed by users or during demonstrations to executives.

nemetroid · on April 2, 2021

Signed integers wrapping and 2’s complement are separate issues. C++20 specifies that signed integer are 2’s complement, but signed overflow is still undefined.

jeffbee · on April 2, 2021

There are loads of DSPs, MCUs, and other non-PC junk where CHAR_BIT is not 8. For example of the SHARC, CHAR_BIT is 32, absolutely every type is 32 bits wide.

rurban · on April 3, 2021

So they can actually support raw strings! Nice

beeforpork · on April 2, 2021

Those are two different things:

For 'char has 8 bits': the bitwidth is 'implementation defined' in C. If you know your target architectures, you can assume it's 8 bits, because it's indeed a question of portability.

For 'int must not overflow': this is 'undefined behaviour' in C. You must not do it, regardless of what you know about your target architectures, because this is a language level prohibition.

AshamedCaptain · on April 2, 2021

Remember to add: that can actually run standard C++ (i.e. with exceptions)?

Certainly you can find an architecture which may run some type of C-like language with strange arithmetic rules (e.g. DSPs). I would bet it's harder to find one such architecture where one can run standard C, and impossible to find one which can run standard C++.

lultimouomo · on April 2, 2021

This. I don't understand why everyone must suffer the pain of the possibility of weird char widths instead of just settling on using a non-standard C in a bunch of DSPs. It's not like you're going to link a bunch of regular run of the mill C libraries on them anyway.

not_knuth · on April 2, 2021

Isn't this an artifact of the age of C? When it was first created it was a major concern to support every architecture, so they put it in the standard. I don't think anyone has wanted to go through the pain of removing it ever since.

After all, who are language nerds to dictate chip manufacturers what the ISA should look like? :P

And it was only in the last 2 decades that everything got dominated by x86...

pornel · on April 2, 2021

C code doesn't just run on the architecture you compile for. It first "runs" on a C virtual machine simulated by the optimizer. This low-level virtual machine (you may call it LLVM) usually implements signed overflow by deleting the code that caused it.

amelius · on April 2, 2021

On ARM, char is always unsigned, whereas on Intel it's usually signed. This silly inconsistency broke a lot of code.

RMPR · on April 2, 2021

What a coincidence this gets posted today, I posted something[0] a couple of hours ago about how specifically a combination of these rules can bite you very hard.

0: https://rmpr.xyz/Integers-in-C/

lifthrasiir · on April 2, 2021

> Signed numbers may be encoded in binary as two’s complement, ones’ complement, or sign-magnitude; this is implementation-defined.

Thankfully, in addition to what MaxBarraclough helpfully pointed out, every (u)intN_t type provided by <stdint.h> is guaranteed to use two's complement even in C99.

dahfizz · on April 2, 2021

One thing I think should have been mentioned: size_t is guaranteed to be large enough to index all of memory, which is why it is the return type of size of.

dusanz · on April 2, 2021

size_t is only guaranteed to be large enough to store the size of the largest object. This is not the same as being able to index all of memory. You could imagine a platform with restricted continuous allocation size where the maximum object size is smaller than the size of the address space.

nayuki · on April 4, 2021

Furthermore, size_t bears no relationship with int; it could be wider/equal/narrower. If size_t is narrower than int, then doing any arithmetic on size_t variables will result in automatic promotion to signed int, which can lead to dangerous signed overflow. C/C++ are full of footguns.

dreinhardt · on April 2, 2021

It’s funny that you would end up with a similar conclusion for other parts of the language (e.g. operators) as well. Just a gigantic set of inane rules everywhere causing you to constantly be in danger of introducing bugs and portability issues.

tammerk · on April 2, 2021

It's more funnier that although language is full of traps, in practice it works quite well. I don't think any C developer(or let's say %95) knows all the rules mentioned in the article, yet we are still one piece.

Does anybody know any paper for bugs per lines of code for different languages or something similar?

nayuki · on April 4, 2021

C/C++ developers not knowing the rules does bite them. For example, it made the 32-bit/64-bit transition much more painful. See https://www.viva64.com/en/a/0004/ ; https://www.viva64.com/en/a/0065/

Joker_vD · on April 2, 2021

But! And that's important -- it allows for great performance, so you can make ten/hundred times more mistakes per second than in other, "safer" languages.

lifthrasiir · on April 2, 2021

> it allows for great performance, so you can make ten/hundred times more mistakes per second than in other, "safer" languages.

This is false. For a long time C performance used to be inferior to Fortran, which is arguably safer than C. It's hilarious that the strict aliasing and `restrict` keyword was born out of making C on par with Fortran and UB became a major issue to C programmers as a result!

atkwarriors · on April 2, 2021

Yes, that's why C has undefined behavior. Absolutely

tammerk · on April 2, 2021

Nowadays, it doesn't provide any performance gain. I didn't see those days but maybe it was important for performance back in 70s/80s/90s even it was risky? e.g null terminated string was chosen due to low space overhead.

creato · on April 2, 2021

It depends on what you are doing. For some kinds of programs, C/C++ are going to be much faster than most "modern" languages.

tammerk · on April 2, 2021

I didn't mean C is not fast or not faster than other languages. It's still the fastest one I believe.

What I meant is undefined behaviors allow compilers to optimize in a way that would not be possible otherwise. So, it might be a deliberate decision back then, to leverage performance. I don't know, just an idea.

jjgreen · on April 2, 2021

It used to be "folk knowledge" that only Fortran and hand-crafted ASM were faster. Not sure if that's still (or ever was) true.

hajile · on April 2, 2021

I guess it was maybe true one time.

http://www.catb.org/jargon/html/story-of-mel.html

tammerk · on April 2, 2021

I agree but I didn't consider these while saying C is the fastest. These are not "general purpose", like you don't write your db, browser, http server or game engine with these.

nicoburns · on April 2, 2021

Most, but not all. Languages like Rust and Zig show that you can have the performance without the landmines.

hajile · on April 2, 2021

Also, theoretical performance is overrated. Almost all the things that lends themselves to speed make code brittle and incapable of future modification.

Once you’ve got your C code doing safety checks with data types that won’t break under the littlest change, the code becomes much slower than code golf would suggest. A common example is passing void pointers everywhere. You either check every call every time (aka dynamic typing) or rush everything on the idea that the programmer understands the system completely and never forgets or messes up. Better types give you all the speed AND all the safety here.

bregma · on April 2, 2021

It discouraging. If the language requires you actually know what you're doing you can't hire dirt-cheap easily-replaced code monkeys to bang out your ideas and the end result is you get to keep less of the investors' money for yourself.

AndriyKunitsyn · on April 2, 2021

It can feel good to imagine yourself an enlightened master among code monkeys, yet on practice, everybody can be a code monkey sometimes, and when this happens in C/C++, it will leave a ticking time bomb in the codebase, that will lay there until a customer blows up on it, no matter how many millions went into QA of the product.

And on practice, C/C++ developers are among lower-paid programmers - probably because "banging out ideas" and producing actual programs that actually work, are valued more than language elitism.

bigcorp-slave · on April 2, 2021

It’s actually not true at all that C++ developers are lower paid. Rather, their pay is highly bimodal. Most work at all FAANGMULA companies is C++.

RMPR · on April 2, 2021

It's a feature, not a bug.

criddell · on April 2, 2021

This is one of the misconceptions:

> sizeof(T) represents the number of 8-bit bytes (octets) needed to store a variable of type T.

That's a misconception I had and I've never run into a problem. What's a platform where sizeof works differently?

Also, what's the reasoning for sizeof to be an operator rather than a function?

GlitchMr · on April 2, 2021

See https://stackoverflow.com/questions/2098149/what-platforms-h.... As for `sizeof` being an operator, well, C doesn't have generics, so it has no choice but to make `sizeof` somehow special.

If you don't want to bother supporting platforms where byte is not 8-bit (a reasonable choice I would say), use `int8_t`/`uint8_t` instead. Those types won't exist on platforms that don't have 8-bit bytes.

masklinn · on April 2, 2021

> If you don't want to bother supporting platforms where byte is not 8-bit (a reasonable choice I would say), use `int8_t`/`uint8_t` instead. Those types won't exist on platforms that don't have 8-bit bytes.

You'll have the issue that, as one of the commenters explained above, `char` is its own thing, independent and separate from `signed char` and `unsigned char` to say nothing of `int8_t` and `uint8_t`. This means that while you can use your own thing for your own functions you can not do so if your values have to interact with libc functions (or most of the ecosystem at large).

If you only want to support platforms using 8-bit chars, you should check CHAR_BIT. That is actually reliable and correct.

MaxBarraclough · on April 2, 2021

> C doesn't have generics, so it has no choice but to make `sizeof` somehow special

It could have used a different syntax though. Ada has a special syntax for compile-time inquiries like this, so there's no way to confuse them with function calls. Ada calls these attributes.

https://en.wikibooks.org/wiki/Ada_Programming/Attributes#Lan...

dvfjsdhgfv · on April 2, 2021

> Python only has one integer type, which is a signed bigint. Compared to C/C++, this renders moot all discussions about bit widths, signedness, and conversions – one type rules all the code. But the price to pay includes slow execution and inconsistent memory usage.

Well, the beauty of C is that you can have that too, if you wish, and you have many options to choose from.