C pointers - main rules

📄 Wiki page | 🕑 Last updated: May 25, 2023

There are four main rules about pointers in C. The first two are pretty basic rules that shouldn't be controversial:

You cannot use a pointer outside of its lifetime (e.g., use-after-free is UB).

You cannot advance a pointer from one object to another object (so out-of-bounds is UB, even if there is another live object there).

The third rule is one that causes issues, but needs to exist given how C code works in practice:

The pointer just past the end of the object is a valid pointer for the object, but it cannot be dereferenced. It may be identical in value to a pointer for another object, but even then, it still cannot be used to access the second object.

The final rule is simultaneously necessary for optimization to occur, not explicitly stated in C itself, and I'm stating vaguely in large part because trying to come up with a formal definition is insanely challenging:

You cannot materialize a pointer to an object out of thin air; you have to be "told" about it somehow.

So the immediate corollary of rule 4, the most obvious instantiation of it: if a variable never has its address taken, then no pointer may modify it without reaching UB. And that's why it's necessary to state: without this rule, then anything that might modify memory would be a complete barrier to optimizations. In a language without integer-to-pointer conversions, there is no way to violate this rule without also violating rules 1-3. But with integer-to-pointer conversions, it is possible to adhere to rules 1-3 and violate this rule, and thus this becomes an important headache for any language that permits this kind of transformation.

So how do we actually give it a formal semantics? Well, the first cut is the simple rule that no pointer may access a no-address-taken variable. Except that's not really sufficient for optimization purposes; in LLVM, all variables start with their address taken, so the optimizer needs to reason about when all uses of the address are known. So you take it to the next level and rule that so long as the address doesn't escape and you can therefore track all known uses, it's illegal for anyone to come up with any other use. So now you need to define escaping, and the classic definition suddenly shifts back to describing a data-dependent relationship.

Let me take a little detour. In the C++11 memory model, one of the modes that was introduced was the release/consume mode, which expressed a release/acquire relationship for any load data-dependent on the consume load. This was added to model the cases where you only need a fence on the Alpha processors. It turns out that no compiler implements this mode; all of them pessimize it to a release/acquire. That's because implementing release/consume would require eliminating every optimization that might not preserve data dependence, of which there is a surprising number. You could get away without doing that if you first proved that the code wasn't in a chain that required preserving data dependence, but that's not really possible for any peephole-level optimization.

And this is where the tension really comes into play. For pointers, it's easy to understand that preserving data dependence is necessary, and special-case them. But now your semantics to adhere to rule 4 also says that you need to do the same to integers, which is basically a non-starter for many optimizations. So the consequence is that the burden of the mismatch needs to lie on integer-to-pointer conversions (which, as I've established before, is already the element that causes the pain in the first place; additionally, in terms of how you compute alias analysis internally in the compiler, it's also where you're going to be dealing with the fallout anyways).

In summary, as you work through the issues to develop a formal semantics, you find that a) pointers have provenance, and need to have some sort of provenance; b) compilers are unwilling to give integers provenance; c) therefore pointers aren't integers, and everything assuming such is wrong (this affects both user code and compiler optimizations!); and d) this is all really hard and at the level of needing academic-level research into semantics.

Is N2676 the final word on pointer provenance? No, it's not; as I said, it's hard and there's still more research that needs to be done on different options. The status quo, in terms of semantics, is broken. The solution needs to minimize the amount of user code that is broken. Maybe N2676 is that solution; maybe it isn't. But to refuse clarification of the situation is unacceptable, and suggests to me noncomprehension of the (admittedly complex!) issues involved.

(from the HN discussion on A Provenance-aware Memory Object Model for C [pdf])