Hacker News new | past | comments | ask | show | jobs | submit login

From the comments:

> C "strings" work the way they do because C is a low level language, where you want to be able to do low-level things when necessary. It's a feature, not a deficiency.

Are NUL-terminated strings really considered preferable, even for low-level work? I always just considered them an unfortunate design choice C was stuck with.

Many O(1) operations/checks become O(n) because you have to linearly traverse the entire string (or keep a second pointer) to know where it ends/how long it is; you can't take a substring within another string without reallocating and copying that part over with a new NUL appended at the end; you can't store data that may contain a NUL (which text shouldn't, in theory, but then you need a separate approach for binary data); and plenty of security issues arise from missing or extra NULs.






C's design is probably the most post-hoc rationalized thing in the world directly after Abrahamic scripture.

"Of course the null-terminated strings of C are more low-level than the length-prefixed strings of Pascal, because the elders of C wisely designed them to be so." Alternatively, something is low-level because it works like C because C semantics have simply become the universal definition of what is thought of as low-level, regardless of machine mismatch.

Likewise, maybe it's not such a good idea that UNIXv6 or other educational unix-likes are used in operating system classes in universities. It's well-applicable, sure, but that's not the point of that education. Maybe we should use a Japanese or German clone of some IBM mainframe system instead, so that people actually get exposed to different ideas, instead of slightly simpler and less sophisticated versions of the ideas they are already familiar with. Too much unix-inbreeding in CS education isn't good.


I agree there's a teaching problem happening somewhere. I'm not sure I blame CS-education since I'd wager that most developers don't have a formal CS background.

I too regularly however come across people who believe some or all of the following:

- "Everything is ultimately just C"

- "All other languages just compile to C, so you should use it to be fast"

- "C is faster because it's closer to bare metal"

- "C is fast because it doesn't need to be interpreted unlike all other languages"

The special elevated position of C, being some kind of "ground truth" of computers is bizarre. It leads to all kinds of false-optimizations in practitioners in other languages out of some kind of misplaced confidence in the speed of C relative to all other languages.

The idea that C is "naturally faster" due to being some kind of representation of a computer that no other language could achieve is a hard myth to shake.


Especially when C advocates tend to ignore the history of systems programming languages predating the language by a decade, because the authors decided it was cooler to do their own thing, notice a similar pattern to other languages?

> Although we entertained occasional thoughts about implementing one of the major languages of the time like Fortran, PL/I, or Algol 68, such a project seemed hopelessly large for our resources: much simpler and smaller tools were called for. All these languages influenced our work, but it was more fun to do things on our own.

-- https://www.nokia.com/bell-labs/about/dennis-m-ritchie/chist...

And using Pascal as counter example gets tiresome, not only it wasn't designed for systems programming, most of its dialects did fix those issues including its revised report (ISO Extended Pascal), by 1978 Niklaus Wirth had created Modula-2, based on Mesa (Xerox PARC replacement for their use of BCPL), both of which never had problem with string lengths.


Well it's just the common name for that particular string representation, even though it certainly existed before Pascal - just like C did not invent null-terminated strings, either.

The name has nothing to do with the insecure way it was implemented in C.

Zero-termination is not lower level than having a separate size, or an end pointer.

What is low level is deciding on an memory representation and working with it directly. A high level language will just have a "string" object, its internal representation is hidden to the programmer and could potentially be changed between versions.

In C, "string" has a precise meaning, it is a pointer to a statically allocated array of bytes with the characters 's', 't', 'r', 'i', 'n', 'g' followed by a zero. That is the low level part, C programmers manipulate the memory directly and need such guarantees. Had it been defined as the number of characters in 4 bytes followed by each character of 2 bytes each in native endian would be just as low level. Defining it as "it is a character string, use the standard library and don't look too closely", as it is the case in Java is high level.

The "feature" is that the memory representation of strings is well defined. The choice of zero-termination has some pros and cons.

Note that in many cases, you can use size+data instead, using mem* functions instead of the str* ones. And though it is not ideal, you can also use "%.*s" in printf(). Not ideal, but workable.


Yeah it was clearly an old design mistake. There's never a situation now where null-terminated strings make more sense than length-prefixed. I'm dubious they were ever better.

Null terminated strings make writing parsers really clean. The null byte becomes another character for you to check against in your parser code, so you don't need a separate check for the length (usually checks are inclusive: it is character 'a'? if not, defer to caller. so checking for null byte can happen in a single ___location, whereas checking for length would need to happen in every function). And it means you have lots of *ptr++ spread around your code, rather than having to pass around a struct and modify it, or call methods on it.

It's really not hard to check the length. Checking null bytes also adds an awkward memory data dependency that can make SIMD more awkward. It also makes strlen O(n) which is kinda shit - for example it led to that famous GTA5 accidental O(n^2) bug.

For situations where a null terminator really is better it's easy to add them to a length-prefixed string, whereas the reverse is not true.

They clearly got this wrong.


They have one advantage, which is saving 3 bytes of memory (depending on what you decide your max supported string length should be) per string. It's hard to imagine an environment where that's a worthwhile tradeoff, even in the most constrained embedded systems (where you can probably get away with a 16-bit length field and thus only save one byte), but they're not completely without merit.



Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: