Sample Video Frame

Created by Zed A. Shaw Updated 2024-02-17 04:54:36
 

Exercise 31: Common Undefined Behavior

At this point in the book, it's time to introduce you to the most common kinds of UB that you will encounter. C has 191 behaviors that the standards committee has decided aren't defined by the standard, and therefore anything goes. Some of these behaviors are legitimately not the compiler's job, but the vast majority are simply lazy capitulations by the standards committee that cause annoyances, or worse, defects. An example of laziness:

An unmatched ‘ or ” character is encountered on a logical source line during tokenization.

In this instance, the C99 standard actually allows a compiler writer to fail at a parsing task that a junior in college could get right. Why is this? Who knows, but most likely someone on the standards committee was working on a C compiler with this defect and managed to get this in the standard rather than fix their compiler. Or, as I said, simple laziness.

The crux of the issue with UB is the difference between the C abstract machine, defined in the standard and real computers. The C standard describes the C language according to a strictly defined abstract machine. This is a perfectly valid way to design a language, except where the C standard goes wrong: It doesn't require compilers to implement this abstract machine and enforce its specification. Instead, a compiler writer can completely ignore the abstract machine in 191 instances of the standard. It should really be called an "abstract machine, but", as in, "It's a strictly defined abstract machine, but..."

This allows the standards committee and compiler implementors to have their cake and eat it, too. They can have a standard that is full of omissions, lax specification, and errors, but when you encounter one of these, they can point at the abstract machine and simply say in their best robot voice, "THE ABSTRACT MACHINE IS ALL THAT MATTERS. YOU DO NOT CONFORM!" Yet, in 191 instances that compiler writers don't have to conform, you do. You are a second class citizen, even though the language is really written for you to use.

This means that you, not the compiler writer, are left to enforce the rules of an abstract computational machine, and when you inevitably fail, it's your fault. The compiler doesn't have to flag the UB, do anything reasonable, and it's your fault for not memorizing all 191 rules that should be avoided. You are just stupid for not memorizing 191 complex potholes on the road to C. This is a wonderful situation for the classic know-it-all type who can memorize these 191 finer points of annoyance with which to beat beginners to intellectual death.

There's an additional hypocrisy with UB that is doubly infuriating. If you show a C fanatic code that properly uses C strings but can overwrite the string terminator, they will say, "That's UB. It's not the C language's fault!" However, if you show them UB that has while(x) x <<= 1 in it, they will say, "That's UB idiot! Fix your damn code!" This lets the C fanatic simultaneously use UB to defend the purity of C's design, and also beat you up for being an idiot who writes bad code. Some UB is meant as, "you can ignore the security of this since it's not C's fault", and other UB is meant as, "you are an idiot for writing this code," and the distinction between the two is not specified in the standard.

As you can see, I am not a fan of the huge list of UB. I had to memorize all of these before the C99 standard, and just didn't bother to memorize the changes. I'd simply found a way to avoid as much UB as I possibly could, trying to stay within the abstract machine specification while also working with real machines. This turns out to be almost impossible, so I just don't write new code in C anymore because of its glaringly obvious problems.

NOTE: The technical explanation as to why C UB is wrong comes from Alan Turing:

  1. C UB contains behaviors that are lexical, semantic, and execution based.
  2. The lexical and semantic behaviors can be detected by the compiler.
  3. The execution-based behaviors fall into Turing's definition of the halting problem, and are therefore NP-hard.
  4. This means that to avoid C UB, it requires solving one of the oldest proven unsolveable problems in computer science, making UB effectively impossible for a computer to avoid.
  5. This leaves it to you to avoid, and you're human who makes many mistakes, especially in C.

To put it more succinctly: "If the only way to know that you've violated the abstract machine with UB is to run your C program, then you will never be able to completely avoid UB."

Previous Lesson Next Lesson

Register for Learn C the Hard Way

Register today for the course and get the all currently available videos and lessons, plus all future modules for no extra charge.