Saturday, January 30, 2010

Weird bugs due to gcc 4.4 and strict aliasing

I was just running the unit tests for GMime and I got a couple of failures on my openSUSE 11.2 machine that I have never gotten before. I started debugging and I noticed something very odd. One of my functions that returned a linked-list of 'word' tokens was returning NULL for something that it should not be returning NULL from, especially since I had just stepped through that method and seen with my own eyes that it was creating and appending nodes to my linked list as it should!

At this point I split out a tiny test case:

#include <stdio.h>
#include <stdlib.h>

typedef struct _node {
    struct _node *next;
    int value;
} Node;

int main (int argc, char **argv)
{
    Node *list, *node, *tail;
    int i;
    
    list = NULL;
    tail = (Node *) &list;
    
    for (i = 0; i < 10; i++) {
        node = malloc (sizeof (Node));
        node->next = NULL;
        node->value = i;
        
        tail->next = node;
        tail = node;
    }
    
    if (list == NULL)
        printf ("oops, list is null???\n");
    
    return 0;
}

I then built this test program using gcc -Wall -g -o list-test list-test.c. When I ran the program, no error.

Huh.

Luckily I thought to rebuild with -O2. This time when I ran my test program, it printed out the error I expected. The list was NULL.

Doing a bit of research suggests that gcc 4.4 has decided to enforce strict aliasing for -O2 and higher, which explains why it works without -O2.

Oddly enough, I get no strict aliasing warnings from gcc even when I use -Wall -O2.

So be warned... gcc 4.4 may break your perfectly valid code in mysterious ways.

Update: Looks like MySQL had similar problems and according to this article, it is suggested that the gcc developers have interpreted the c99 specification far too strictly.

Update 2: Josh's comment had an excellent explanation of the problem this code hits, so I thought I'd add it to my blog post for anyone who may otherwise miss his comment:

I imagine that gcc's aliasing thinks that dereferencing a Node* can only write to Node objects, so it won't have any effect on Node*s like list. Thus it thinks that list is never modified in that function, and it propagates the NULL constant throughout.

I also liked Josh's explanation of why the above code works (since a number of people have been confused by it already, I'll post that too):

The only reason it "works" in the -O0 case is because the next field happens to be the first field, as Fabian pointed out. The first time you write to tail->next, you're writing to &list with no offset, so list happens to get the value of node. When next does have an offset, then the value of list is not changed, but something else on the stack after it will be clobbered.

33 comments:

Fabian said...

Hmm I really had to think quite a while to figure out what the code was doing.
Which is probably because I program in C++ and C-style cast are frowned.

So is this really good programming style, relying on the memory layout of the node structure?

I mean it e.g. will also break immediately if you change the structure to:

typedef struct _node {
int x;
struct _node *next;
int value;
} Node;

So at least add a BIG note, the next pointer must be the very first member of the node struct.

Marcus Meissner said...

Your code is broken actually.

&list will give you a Node** pointer, not a Node* ...

Your initial setup of "tail" should be different, otherwise you might even overwrite random stack values.

just think, what is "tail->next" pointing at during the first iteration.

Richard said...

Or just avoid this by not doing

Node *list, *node, *tail;
tail = (Node *) &list; // wtf?


but instead

Node list, *node, *tail;
tail = &list;
tail->next = node;

which would have worked fine.

Learn some C.

Josh said...

This cast should be a warning sign: "tail = (Node *) &list;" The tail is a Node*, but &list is a Node**!

The only reason it "works" in the -O0 case is because the next field happens to be the first field, as Fabian pointed out. The first time you write to tail->next, you're writing to &list with no offset, so list happens to get the value of node. When next does have an offset, then the value of list is not changed, but something else on the stack after it will be clobbered.

I imagine that gcc's aliasing thinks that dereferencing a Node* can only write to Node objects, so it won't have any effect on Node*s like list. Thus it thinks that list is never modified in that function, and it propagates the NULL constant throughout.

Josh said...

I know this code is just an example, but here's how I would fix it:

Node **tail;
...
tail = &list;
...
for (...) {
...
*tail = node;
tail = &node->next;
}

So tail really tracks the next pointer of the last element, which starts out as the pointer to whole list itself.

Josh said...

Sorry to keep replying... I realized after re-reading this that you probably already understood just fine what I tried to explain. I still think code like that is bad form, but there's always -fno-strict-aliasing for you.

flo said...

This looks really hacky, and i don't think it's a gcc bug ... this is clearly against the strict aliasing rule.

Why not just use a stack variable:

Node list;
tail=&list;

Then return list.next as the pointer to the list.

ben said...

Sorry, I cannot follow. Where does your code violate the strict aliasing rules? Your code does not reach the printf when I run it via gcc 4.4.2 at any -O level.

Jeffrey Stedfast said...

Marcus:

You're wrong. The code is not incorrect. On the first iteration, tail->next points to list.

tail->next = node will set the list pointer to node.

Jeffrey Stedfast said...

Richard: see my response to Marcus. My C code is fine.

In fact, if you add the following code before returning 0 in the example program I pasted, you'll see for yourself:

printf ("list = ");
for (node = list; node; node = node->next)
printf ("%d->", node->value);
printf ("NULL\n");

prints:
list = 0->1->2->3->4->5->6->7->8->9->NULL

Jeffrey Stedfast said...

Josh: yea, that's probably how I will end up fixing it in the long run. In the short term I've simply added -fno-strict-aliasing to the build.

Jeffrey Stedfast said...

Ben:

Odd. If I use -O2, list is NULL after the loop with gcc 4.4.1 [gcc-4_4-branch revision 150839]

In any case, Josh's explanation of the problem is dead on afaict.

Damien said...

I disagree, the code IS bad, it uses a trick to work.

Node *list, *node, *tail;
list = NULL;
tail = (Node *)&list;

Here, "tail", which is a pointer, has a value of "&list", which is the address of the pointer "list".

So "tail" points to an address where there's only space for a pointer, in itself it's not very safe.

Ok, the trick then, in the for loop:

tail->next = node;

Here, "tail" still points to the "list" pointer, tail->next is the same as "*tail" because "next" is the first field of the structure and it's a similar pointer too.

"tail = &list" from the previous pre-for-loop line, so:
"tail->next = node" is the same as:
"*tail = node", which is the same as:
"*(&list) = node", which translates in the end in:
"list = node".

For me, this is a trick. Any code that need a cast to compile cleanly with "-Wall -Wextra" has a chance to be buggy.

This code uses a *trick*, so if it breaks, it's not that odd.
I could add that the code is not very explicit so it's easy to make mistakes or forget that you shall always have "next" as the first attribute of the structure.

There's two ways to write the same thing without tricks, the first is to modify "list" in the for loop when needed, the second is to use a variable to act like a "first" element.

Both ways works with gcc 4.4, don't use any cast and compile with no real warning with "-Wall -Wextra".
Agreed, the first adds two tests to the loop, and the second takes a little more memory.
I would go for the second one if performance in this code was the bottleneck of my program.

==== 1 ====

int main (int argc, char **argv)
{
Node *list = NULL, *tail = NULL, *node = NULL;
int i;

for (i = 0; i < 10; i++) {
node = malloc (sizeof (Node));
node->next = NULL;
node->value = i;

if(tail != NULL)
tail->next = node;
tail = node;
if(list == NULL)
list = tail;
}

if (list == NULL)
printf ("oops, list is null???\n");

return 0;
}

==== 2 ====

int main (int argc, char **argv)
{
Node first;
Node *list, *node, *tail;
int i;

first.next = NULL;
tail = &first;

for (i = 0; i < 10; i++) {
node = malloc (sizeof (Node));
node->next = NULL;
node->value = i;

tail->next = node;
tail = node;
}

list = first.next;

if (list == NULL)
printf ("oops, list is null???\n");

return 0;
}

Jeffrey Stedfast said...

Damien:

So... the code is bad because it relies on assumptions that we can guarantee to be true?

Yes, the loop relies on the fact that the first member of Node is the next pointer. However, the next pointer is the first member. We have control of that.

Damien said...

I understand that you expect it to work just fine.
But on the other hand modern compilers do apply a lot of rules to the given source code before creating a binary.
And I belive myself that a non-tricky code is better for maintainability anyway.

In the end, I think that everyone has a point, it's the c99 spec that will decide which behaviour to enforce.

Jeffrey Stedfast said...

Damien:

> And I belive myself that a non-tricky
> code is better for maintainability
> anyway.

That's a fair argument, but it's not the same as saying code is broken ;-)

taschenorakel said...

Jeffrey, you have a very strange definition of "correct code". I'd prefer to never work in the same team with you, cause I'd hate to debug your weird and broken hacks.

Jeffrey Stedfast said...

taschenorakel:

It's not my fault you don't understand pointers ;-)

Jeff Walden said...

I agree that Josh's way is better and, I would argue, canonical for this sort of trick. I'm sure I've seen it before, and it's exactly what I came up with when I decided to work out a fix prior to reading comments here.

Josh said...

RE Update 3: You should know that with optimized code, gdb is going to be pretty misleading. Think about how the unroll will look with 2 iterations:

list = NULL;
tail = (Node *) &list;
// loop 0
node = malloc (sizeof (Node));
node->next = NULL;
node->value = 0;
tail->next = node;
tail2 = node;
// loop 1
node2 = malloc (sizeof (Node));
node2->next = NULL;
node2->value = 1;
tail2->next = node2;
tail3 = node;
// ending
if (list == NULL)
printf ("oops, list is null???\n");

I've given a new variable name for each new assignment. Because of the strict aliasing, the NULL list is propagated to the "if", which is now constant so it needs no code generated. Also, tail and tail2 are only used in one place, so we can propagate them. And tail3 can be dropped.

// loop 0
node = malloc (sizeof (Node));
node->next = NULL;
node->value = 0;
((Node *) &list)->next = node;
// loop 1
node2 = malloc (sizeof (Node));
node2->next = NULL;
node2->value = 1;
node->next = node2;
// ending
printf ("oops, list is null???\n");

With that, you have roughly the sequence gdb showed you. Check out the disassembled code to see what's really going on.

Jeffrey Stedfast said...

Josh:

Good point, gdb might not be correctly giving me what it's really doing there due to optimizing out temporary variables.

I'll have to check the asm later.

Anonymous said...

Jeffrey, I'm the author of the article you reference talking about the GCC interpretation of the aliasing rules. First thing I should point out is that Mysql had a problem because they actually broke the rules, not because of GCC.

That being said, according to the standard, your code example here is legal. Those people calling it "broken" simply don't know what they're talking about.

Relevant section of the C99 spec is 6.5 paragraph 7; values may be accessed via "an aggregate or union type that includes" a type compatible to the declared type. In this case you're accessing a "Node *" via a pointer to "Node", which is fine because Node is an aggregate type that includes a "Node *" member.

Jeffrey Stedfast said...

davmac: thanks for the info.

I've submitted a bug report to the gcc devs but they've marked it INVALID :-(

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42907

ben said...

> Relevant section of the C99 spec is 6.5 paragraph 7; values may be accessed via "an aggregate or union type that includes" a type compatible to the declared type

I do not understand how that is even relevant, tail->next is an lvalue expression of type Node*, list is a stored object of declared and thus effective type Node*.

Aliasing restrictions should not even play into this. And it is guaranteed that the struct's first element is right at the beginning of the struct, so I really see no wiggle room to declare this undefined behaviour.

Anonymous said...

The code does not comply with C99, and therefore the compiler can do whatever it wants, as per section 3.4.3 and section 4 para 2.

Regarding compliance, the relevant section of C99 is indeed 6.5 para 7, but you need to read the whole thing. The relevant access is the dereference of &list (i.e. a Node **) via a Node *. The assignments of next pointers, or anything else involving aggregate types, are irrelevant. It'd be similarly non-conformant if list and tail were int * types, for example.

The problem is that &list has type Node **, not Node *, as Josh pointed out. At this point if you dereference the resulting pointer then all bets are off; that it works at -O0 is lucky (or unlucky, depending on your point of view).

The important bullet in 6.5p7 is the first one - "a type compatible with the effective type of the object", which we contravene. The remaining clauses (we must only satisfy one of them) do not apply: we aren't using qualifiers (#2), we are using integer types (#3, #4), we aren't using an aggregate type - Node * and Node ** are pointer types, not structs or unions; that they point to aggregate types is irrelevant (#5), and we aren't using character types (#6).

Saying "tail = (Node *)&list; ...; tail->next = node" therefore does not comply with 6.5p7, and is therefore not valid C. By 3.4.3 and 4p2, gcc could reformat your hard drive, or anything else that it feels like doing.

flo said...

ben: but i think the cast from Node** to Node* is invalid (or the dereferencation of the resulting pointer). so i think gcc is still (and rightfully) thinking that Node** cannot alias Node*

Anonymous said...

prqjkuemdz2dbxtf - you are wrong.

You are correct that the declared type of "&list" is "Node **", but you are wrong that the object consisting of that "Node **" is ever dereferenced.

The assignment:

tail = &list;

... takes the "Node **", casts it to "Node *" (effectively creating a new object) and then copies that new object into "tail".

Eamon Nerbonne said...

The very fact that this discussion is possible in the first place suggests to me that this code is poor practice; whatever the spec says, code should avoid being confusing for no purpose (and there are alternatives which work just fine).

Nevertheless, it's also worrying that gcc isn't warning you about a bug it should trivially be able to detect. Read/write dependencies are presumably standard fare for compilers, and it should be able to detect the fact that you're making an invalid cast and then writing to the result (which, due to strict aliasing, is compile time known to be invalid). A user friendly error message might even take into consideration the shape of the struct (to differentiate between nonsense code and invalidly aliasing code).

If a compiler won't even warn you about broken code, it's a broken compiler, IMHO.

mmeeks said...

Urgh; I'm with gcc on this - the optimisations you can do with good aliasing detection are -so- important - that to turn them off to allow this sort of gross-ness to work would be really silly IMHO etc. :-)
Writing code that can be easily read, debugged and understood - turns out to have a bonus: the compiler can understand it too :-)

Anonymous said...

davmac: "The assignment: tail = &list; ... takes the "Node **", casts it to "Node *" (effectively creating a new object) and then copies that new object into "tail".
"

Casting doesn't "create a new object" as you suggest - it just overrides type-checking (in this case). The fact remains that you're taking a pointer to X (aka Node **), casting it to a pointer to Y (aka Node * - they're type-incompatible; that they're one level of indirection apart is irrelevant), and then dereferencing it via the -> operator. That's undefined.

At no point do you actually have a pointer-to-a-Node. You have what you're *claiming* (via the cast) is a pointer-to-a-Node, but it isn't.

Eamon Nerbonne: "[the compiler] should be able to detect the fact that you're making an invalid cast". Indeed: the compiler can detect (via type checking) that what you're attempting is illegal. The cast is telling the compiler to shut up, because I know best. Without the cast it won't compile, and neither should it.

Anonymous said...

prqjkuemdz2dbxtf: if you're going to bandy the C99 standard about in support of your argument, you'd better read it first.

To quote your earlier post,
"Regarding compliance, the relevant section of C99 is indeed 6.5 para 7, but you need to read the whole thing."

Too right you do. 6.5p7 concerns access to objects: "An object shall have its stored value accessed only by an lvalue expression that has one of
the following types:"

As it happens the term object is clearly defined by the standard (in the definitions part, funnily enough). If you read it you'll find that "&list" is not an object (it yields a value which might then be stored in an object, sure, but that's another matter).

You say: "The relevant access is the dereference of &list (i.e. a Node **) via a Node *"

But, I repeat, "&list" is never dereferenced and in fact "&list" is not an object (check the definition), which means 6.5p7 doesn't apply in the sense you seem to think it does. You can't dereference one pointer "via" another one. They are two separate pointers. We could argue about whether the result of casting a "Node **" to a "Node *" can possibly be a valid pointer but 6.5p7 doesn't say anything about it. The result of the cast is a value which happens to be the same (if you look at it bitwise) as "&list", and that value is then stored in the "tail" variable.

Even the GCC devs, who decided to ignore 6.5p7, at least understand what it means - the decision to ignore it was/is a conscious one.

In your recent post: "Casting doesn't "create a new object" as you suggest"

... alright, fine, because there is no object anyway. Assigning the result of the cast to "tail" is what "creates a new object". This is what you're not getting.

Eamon: the code might not be the best practice, but the C99 standard is pretty clear on this issue.

mmeeks: Following 6.5p7 doesn't imply you need to disable all type-based alias analysis, not by a long shot. Also, you seem to imply that it's better to generate optimized code than correct code. Your complaint is a (valid) criticism of the C language, but your suggestion that GCC is doing the right thing is plain wrong.

Anonymous said...

"if you're going to bandy the C99 standard about in support of your argument, you'd better read it first."

Indeed - it would be rather silly not to. That you feel qualified to suggest that I haven't is disappointing.


"If you read it you'll find that "&list" is not an object"

Indeed - it's an expression (6.5.3).


"But, I repeat, "&list" is never dereferenced"

Yes it is.

tail = (Node *)&list
...
tail -> ...

The "->" is a dereference of &list. That we have a type cast en route doesn't alter that fact.


"They are two separate pointers."

No, there's one. We're referring to it by two (incompatible) types, but there's only ever one pointer.


"The result of the cast is a value which happens to be the same (if you look at it bitwise) as "&list""

The standard doesn't specify how a pointer type is represented; making any assumption whatsoever about its "bitwise" representation is therefore outside the standard. Your statement /might/ be correct for some given C implementation, but there's nothing in the standard that guarantees that. As you suggest, reading the standard is illuminating. (Or not, because it doesn't illuminate what it doesn't specify.)


"Assigning the result of the cast to "tail" is what "creates a new object"."

No. An "object", as defined in section 3.14 (as you graciously refer me to) is a "region of data storage in the execution environment". To create a new object we therefore need a new region of data storage; no assignment to any pointer type does this. You don't create objects via assignment.


"This is what you're not getting."

That I don't "get" something that is not true doesn't unduly concern me.

Anonymous said...

prqjkuemdz2dbxtf:

This is the last time, hopefully I can persuade you, or at least help others not make the same mistakes.

"Indeed - it's an expression (6.5.3)"

So 6.5p7 doesn't apply to it, but before you insisted that it did.

(Actually, its value might be stored in a temporary object. But that doesn't change anything, because that object is only accessed via its effective type; the key point is yet to come).

"Yes it is."

No, it's not.

"No, there's one. We're referring to it by two (incompatible) types, but there's only ever one pointer."

This is really a quibble about the definition of "pointer".

Here's the key point:

Nowhere in the whole standard does it say that the conversion from one pointer type to another yields the same pointer. Not even direct assignment yields the same pointer - it yields two pointers with the same value.

Section 6.7.5.1 (if you read it carefully) defines variables which are declared to be of pointer type to be pointers; i.e. a variable declared as a pointer is a pointer. If I point two of them at the same object via assignment from one to the other then I still have two pointer variables, and therefore two pointers.

Assigning via a cast makes no difference; I still have two pointers.

"The standard doesn't specify how a pointer type is represented; making any assumption whatsoever about its "bitwise" representation is therefore outside the standard."

Yes I know, but that's irrelevant anyway. I had assumed you thought the pointer was the "same" after the cast because it was the same "value", and was trying to explain that the same bitwise representation doesn't mean it's the same pointer. It's irrelevant to the argument though.

"No. An "object", as defined in section 3.14"

See 6.7.5p2, "Each declarator declares one identifier, and asserts that when an operand of the same form as the declarator appears in an expression, it designates a function or object with the scope, storage duration, and type indicated by the declaration specifiers."

The emphasis is mine, but it clearly says that a variable declaration designates a object. Variable assignment, then, stores a value in the object; I'll grant that this is not specifically the same as creating the object - it's the declaration which did that. The point is that the "tail" variable designates its own distinct object. "&list" is NOT the same object even if its value is copied via assignment.

"To create a new object we therefore need a new region of data storage; no assignment to any pointer type does this"

An assignment can be part of a declaration, and a declaration does allocate a region of data storage. As it happens, the code we are talking about could have been written that way (i.e. declaration and assignment at the same time). My mistake for temporarily forgetting that the declaration and assignment were separate; but it doesn't change the validity of my argument.

"That I don't "get" something that is not true doesn't unduly concern me."

My point was that you're not getting something that is true.

Code Snippet Licensing

All code posted to this blog is licensed under the MIT/X11 license unless otherwise stated in the post itself.