Saturday, January 30, 2010

Weird bugs due to gcc 4.4 and strict aliasing

I was just running the unit tests for GMime and I got a couple of failures on my openSUSE 11.2 machine that I have never gotten before. I started debugging and I noticed something very odd. One of my functions that returned a linked-list of 'word' tokens was returning NULL for something that it should not be returning NULL from, especially since I had just stepped through that method and seen with my own eyes that it was creating and appending nodes to my linked list as it should!

At this point I split out a tiny test case:

#include <stdio.h>
#include <stdlib.h>

typedef struct _node {
    struct _node *next;
    int value;
} Node;

int main (int argc, char **argv)
{
    Node *list, *node, *tail;
    int i;
    
    list = NULL;
    tail = (Node *) &list;
    
    for (i = 0; i < 10; i++) {
        node = malloc (sizeof (Node));
        node->next = NULL;
        node->value = i;
        
        tail->next = node;
        tail = node;
    }
    
    if (list == NULL)
        printf ("oops, list is null???\n");
    
    return 0;
}

I then built this test program using gcc -Wall -g -o list-test list-test.c. When I ran the program, no error.

Huh.

Luckily I thought to rebuild with -O2. This time when I ran my test program, it printed out the error I expected. The list was NULL.

Doing a bit of research suggests that gcc 4.4 has decided to enforce strict aliasing for -O2 and higher, which explains why it works without -O2.

Oddly enough, I get no strict aliasing warnings from gcc even when I use -Wall -O2.

So be warned... gcc 4.4 may break your perfectly valid code in mysterious ways.

Update: Looks like MySQL had similar problems and according to this article, it is suggested that the gcc developers have interpreted the c99 specification far too strictly.

Update 2: Josh's comment had an excellent explanation of the problem this code hits, so I thought I'd add it to my blog post for anyone who may otherwise miss his comment:

I imagine that gcc's aliasing thinks that dereferencing a Node* can only write to Node objects, so it won't have any effect on Node*s like list. Thus it thinks that list is never modified in that function, and it propagates the NULL constant throughout.

I also liked Josh's explanation of why the above code works (since a number of people have been confused by it already, I'll post that too):

The only reason it "works" in the -O0 case is because the next field happens to be the first field, as Fabian pointed out. The first time you write to tail->next, you're writing to &list with no offset, so list happens to get the value of node. When next does have an offset, then the value of list is not changed, but something else on the stack after it will be clobbered.

Code Snippet Licensing

All code posted to this blog is licensed under the MIT/X11 license unless otherwise stated in the post itself.