A Moment of Zen: sorting

Showing posts with label sorting. Show all posts

Thursday, April 7, 2011

Optimizing Merge Sort

A number of years ago I wrote about the Merge Sort algorithm. One of the advantages of Merge Sort is that it is a stable sort, meaning that elements that compare as being equal remain in their original order after being sorted.

Well, today I had need of employing a stable sorting routine for sorting elements by a ZIndex in Moonlight. Up until today, we had been using qsort() which, while not guaranteed to be a stable sort on any platform, happens to be implemented in glibc as a stable sort except in out-of-memory conditions. Since we'd like Moonlight to work on platforms other than Linux+glibc (such as Mac OS or BSD), it has become important enough to implement properly.

To start, I dusted off my generic MergeSort() implementation from years ago when I was writing articles about various sorting algorithms. This is what I had to start with:

#define MID(lo, hi) (lo + ((hi - lo) >> 1))

static void
msort (void *array, void *buf, size_t low, size_t high, size_t size,
       int (* compare) (const void *, const void *))
{
    register char *lo, *hi, *b;
    char *al, *am, *ah;
    size_t mid;
    
    mid = MID (low, high);
    
    if (mid + 1 < high)
        msort (array, buf, mid + 1, high, size, compare);
    
    if (mid > low)
        msort (array, buf, low, mid, size, compare);
    
    ah = ((char *) array) + ((high + 1) * size);
    am = ((char *) array) + ((mid + 1) * size);
    al = ((char *) array) + (low * size);
    
    b = (char *) buf;
    lo = al;
    hi = am;
    
    while (lo < am && hi < ah) {
        if (compare (lo, hi) <= 0) {
            memcpy (b, lo, size);
            lo += size;
        } else {
            memcpy (b, hi, size);
            hi += size;
        }
        
        b += size;
    }
    
    if (lo < am)
        memcpy (b, lo, am - lo);
    else if (hi < ah)
        memcpy (b, hi, (ah + size) - hi);
    
    memcpy (al, buf, ah - al);
}

int
MergeSort (void *base, size_t nmemb, size_t size,
           int (* compare) (const void *, const void *))
{
    void *tmp;
    
    if (nmemb < 2)
        return 0;
    
    if (!(tmp = malloc (nmemb * size))) {
        errno = ENOMEM;
        return -1;
    }
    
    msort (base, tmp, 0, nmemb - 1, size, compare);
    
    free (tmp);
    
    return 0;
}

Since performance is very important, I clocked this implementation against qsort() and got the following results on my Intel Core2 Quad Q6600 2.4 GHz machine using arrays of 10 million ints:

Randomized input: 14.13s vs qsort()'s 6.77s
Sorted input: 4.41s vs qsort()'s 1.54s
Reversed input: 4.26s vs qsort()'s 1.90s

Clearly the above MergeSort() implementation did not fare well against glibc's qsort() on my system, so it was time to look at what I could do to improve the performance.

The most obvious optimization I could see was to try and batch my memcpy() calls. In other words, instead of calling memcpy() to copy each and every element into our temporary buffer, it'd be more efficient to copy blocks of elements at a time:

static void
msort (void *array, void *buf, size_t low, size_t high, size_t size,
       int (* compare) (const void *, const void *))
{
    char *al, *am, *ah, *ls, *hs, *lo, *hi, *b;
    size_t mid;
    
    mid = MID (low, high);
    
    if (mid + 1 < high)
        msort (array, buf, mid + 1, high, size, compare);
    
    if (mid > low)
        msort (array, buf, low, mid, size, compare);
    
    ah = ((char *) array) + ((high + 1) * size);
    am = ((char *) array) + ((mid + 1) * size);
    al = ((char *) array) + (low * size);
    
    b = (char *) buf;
    lo = al;
    hi = am;
    
    do {
        ls = lo;
        hs = hi;
        
        if (lo > al || hi > am) {
            /* our last loop already compared lo & hi and found lo <= hi */
            lo += size;
        }
        
        while (lo < am && compare (lo, hi) <= 0)
            lo += size;
 
        if (lo > ls) {
            memcpy (b, ls, lo - ls);
            b += (lo - ls);
        }
 
        if (lo < am) {
            /* our last compare tells us hi < lo */
            hi += size;
            
            while (hi < ah && compare (hi, lo) < 0)
                hi += size;
            
            memcpy (b, hs, hi - hs);
            b += (hi - hs);
        }
    } while (lo < am && hi < ah);
    
    if (lo < am)
        memcpy (b, lo, am - lo);
    else if (hi < ah)
        memcpy (b, hi, ah - hi);
    
    memcpy (al, buf, ah - al);
}

The results were promising. For the exact same inputs (including the exact same random array), we now get:

Randomized input: 10.45s
Sorted input: 2.08s
Reversed input: 2.03s

The only other way that we can reduce the number of memcpy() calls we make is to avoid copying leading and trailing elements into our temporary buffer if it's not necessary to merge them. Here's the solution I came up with:

static void
msort (void *array, void *buf, size_t low, size_t high, size_t size,
       int (* compare) (const void *, const void *))
{
    char *a1, *al, *am, *ah, *ls, *hs, *lo, *hi, *b;
    size_t copied = 0;
    size_t mid;
    
    mid = MID (low, high);
    
    if (mid + 1 < high)
        msort (array, buf, mid + 1, high, size, compare);
    
    if (mid > low)
        msort (array, buf, low, mid, size, compare);
    
    ah = ((char *) array) + ((high + 1) * size);
    am = ((char *) array) + ((mid + 1) * size);
    a1 = al = ((char *) array) + (low * size);
    
    b = (char *) buf;
    lo = al;
    hi = am;
    
    do {
        ls = lo;
        hs = hi;
        
        if (lo > al || hi > am) {
            /* our last loop already compared lo & hi and found lo <= hi */
            lo += size;
        }
        
        while (lo < am && compare (lo, hi) <= 0)
            lo += size;
        
        if (lo < am) {
            if (copied == 0) {
                /* avoid copying the leading items */
                a1 = lo;
                ls = lo;
            }
            
            /* our last compare tells us hi < lo */
            hi += size;
            
            while (hi < ah && compare (hi, lo) < 0)
                hi += size;
            
            if (lo > ls) {
                memcpy (b, ls, lo - ls);
                copied += (lo - ls);
                b += (lo - ls);
            }
            
            memcpy (b, hs, hi - hs);
            copied += (hi - hs);
            b += (hi - hs);
        } else if (copied) {
            memcpy (b, ls, lo - ls);
            copied += (lo - ls);
            b += (lo - ls);
            
            /* copy everything we needed to re-order back into array */
            memcpy (a1, buf, copied);
            return;
        } else {
            /* everything already in order */
            return;
        }
    } while (hi < ah);
    
    if (lo < am) {
        memcpy (b, lo, am - lo);
        copied += (am - lo);
    }
    
    memcpy (a1, buf, copied);
}

Once again, reducing the amount of copying paid off:

Randomized input: 9.80s
Sorted input: 0.95s
Reversed input: 2.05s

Update 2011-05-18: One final optimization that can be tried is pre-calculating the optimum way to copy elements between buffers. This calculation, while not terribly expensive itself, adds up with every call to memcpy(). Let's start off by writing some handy macros:

#define COPYBY(TYPE, a, b, n) {         \
    long __n = (n) / sizeof (TYPE);     \
    register TYPE *__a = (TYPE *) (a);  \
    register TYPE *__b = (TYPE *) (b);  \
                                        \
    do {                                \
        *__a++ = *__b++;                \
    } while (--__n > 0);                \
}

#define MEMCOPY(dest, src, n) {                 \
    switch (copy_mode) {                        \
    case 1: COPYBY (long, dest, src, n); break; \
    case 2: COPYBY (int, dest, src, n); break;  \
    default: memcpy (dest, src, n);             \
    }                                           \
}

Now that these handy macros are written, we can plug them into our Merge Sort implementation:

static void
msort (void *array, void *buf, size_t low, size_t high, size_t size,
       int copy_mode, int (* compare) (const void *, const void *))
{
    char *a1, *al, *am, *ah, *ls, *hs, *lo, *hi, *b;
    size_t copied = 0;
    size_t mid;
    
    mid = MID (low, high);
    
    if (mid + 1 < high)
        msort (array, buf, mid + 1, high, size, compare);
    
    if (mid > low)
        msort (array, buf, low, mid, size, compare);
    
    ah = ((char *) array) + ((high + 1) * size);
    am = ((char *) array) + ((mid + 1) * size);
    a1 = al = ((char *) array) + (low * size);
    
    b = (char *) buf;
    lo = al;
    hi = am;
    
    do {
        ls = lo;
        hs = hi;
        
        if (lo > al || hi > am) {
            /* our last loop already compared lo & hi and found lo <= hi */
            lo += size;
        }
        
        while (lo < am && compare (lo, hi) <= 0)
            lo += size;
        
        if (lo < am) {
            if (copied == 0) {
                /* avoid copying the leading items */
                a1 = lo;
                ls = lo;
            }
            
            /* our last compare tells us hi < lo */
            hi += size;
            
            while (hi < ah && compare (hi, lo) < 0)
                hi += size;
            
            if (lo > ls) {
                MEMCOPY (b, ls, lo - ls);
                copied += (lo - ls);
                b += (lo - ls);
            }
            
            MEMCOPY (b, hs, hi - hs);
            copied += (hi - hs);
            b += (hi - hs);
        } else if (copied) {
            MEMCOPY (b, ls, lo - ls);
            copied += (lo - ls);
            b += (lo - ls);
            
            /* copy everything we needed to re-order back into array */
            MEMCOPY (a1, buf, copied);
            return;
        } else {
            /* everything already in order */
            return;
        }
    } while (hi < ah);
    
    if (lo < am) {
        MEMCOPY (b, lo, am - lo);
        copied += (am - lo);
    }
    
    MEMCOPY (a1, buf, copied);
}

int
MergeSort (void *base, size_t nmemb, size_t size,
           int (* compare) (const void *, const void *))
{
    int copy_mode;
    void *tmp;
    
    if (nmemb < 2)
        return 0;
    
    if (!(tmp = malloc (nmemb * size))) {
        errno = ENOMEM;
        return -1;
    }
    
    if ((((char *) base) - ((char *) 0)) % sizeof (long) == 0 && (size % sizeof (long)) == 0)
        copy_mode = 1;
    else if ((((char *) base) - ((char *) 0)) % sizeof (int) == 0 && (size % sizeof (int)) == 0)
        copy_mode = 2;
    else
        copy_mode = 0;
    
    msort (base, tmp, 0, nmemb - 1, size, copy_mode, compare);
    
    free (tmp);
    
    return 0;
}

This handy trick seems to have worked out rather well:

Randomized input: 7.79s
Sorted input: 0.99s
Reversed input: 1.69s

At this point, I can't think of any other obvious optimizations so I'm going to call it a day.

For a recap, here are the results of all 4 implementations compared side-by-side with the results from qsort():

	qsort()	msort() v1	msort() v2	msort() v3	msort() v4
random:	6.77	14.13	10.45	9.80	7.79
sorted:	1.54	4.41	2.08	0.95	0.99
reversed:	1.90	4.26	2.03	2.05	1.69

Saturday, March 3, 2007

Quick Sort

Like Merge Sort, Quick Sort promises O(n lg n) time in the average case, and, like Merge Sort, is also a recursive algorithm. However, it is important to be aware that Quick Sort is O(n²) worst case, where, ironically, the input is already mostly sorted in either direction.

The way Quick Sort works is that it first chooses a pivot point (in our case, we'll just choose the middle element) with which to partition the working sequence into two partitions. Next, it puts all items with values smaller than the pivot value into the first partition and all items with values greater than the pivot value into the second partition. Recursively repeat the process on each partition.

Now that we understand how things are supposed to work, lets implement our QuickSort function (keeping our same call signature from previous sorting posts):

#define MID(lo, hi) (lo + ((hi - lo) >> 1))

#define SWAP(a, b) tmp = array[a], array[a] = array[b], array[b] = tmp

static void
QuickSortR (int *array, size_t lo, size_t hi)
{
    register size_t i, k;
    register int pivot;
    int tmp;
    
    /* make sure we have at least 2 elements */
    if ((hi - lo) < 1)
        return;
    
    /* cache our pivot value */
    pivot = array[MID (lo, hi)];
    
    i = lo;
    k = hi;
    
    do {
        /* find the first element with a value >= pivot value */
        while (i < k && array[i] < pivot)
            i++;
        
        /* find the last element with a value <= pivot value */
        while (k > i && array[k] > pivot)
            k--;
        
        if (i <= k) {
            SWAP (i, k);
            i++;
            k--;
        } else {
            break;
        }
    } while (1);
    
    if (lo < k) {
        /* recursively sort the first partition */
        QuickSortR (array, lo, k);
    }
    
    if (i < hi) {
        /* recursively sort the second partition */
        QuickSortR (array, i, hi);
    }
}

void
QuickSort (int *array, int n)
{
    QuickSortR (array, 0, n - 1);
}

Before we go any further, lets see how this compares to our Merge Sort implementation from earlier.

Since I no longer have the same machine that I timed Merge Sort with, I had to re-time it on my laptop (which is where I've been doing this since I can hack while laying back on my comfortable couch). My laptop is an IBM T40 ThinkPad which sports an Intel Pentium M Processor rated at 1600 MHz.

Since I left off sorting 10,000,000 items in the Merge Sort post, I figured I'd use the same array size here.

For Merge Sort, I'm consistently getting around 6.96s. For our Quick Sort implementation, I'm getting about 5.40s.

It would be nice if we could get rid of the function call overhead of recursion like we did with Binary Insertion Sort.

We're going to have to get crafty in order to work around the need to make our function recursive. What we can do is keep our own stack, but growing it dynamically could potentially kill any performance gains we could hope for... so, lets see what Donald Knuth has to say about the mathematical properties of this sort algorithm.

An auxiliary stack with at most [lg N] entries is needed for temporary storage.

As it happens, log₂ of the max unsigned value of any integer type is the number of bits in said integer type. This means that on my system, where integers are 32bit, I'll need a stack size of 32.

What do we store on our stack? Well, what variables do we need for QuickSortR()? We need array, high, and low. If we're going to make QuickSort() non-recursive, then that means we'll always have access to array which means the only variables we need to hold on our stack are high and low.

So here it is, your Moment of Zen:

typedef struct qstack {
    size_t lo, hi;
} qstack_t;

void
QuickSort (int *array, size_t n)
{
    qstack_t stack[32], *sp;
    register size_t i, k;
    register int pivot;
    size_t lo, hi;
    int tmp;
    
    /* initialize our stack */
    sp = stack;
    sp->hi = n - 1;
    sp->lo = 0;
    sp++;
    
    do {
        /* pop our lo and hi indexes off the stack */
        sp--;
        lo = sp->lo;
        hi = sp->hi;
        
        if ((hi - lo) < 1)
            continue;
        
        /* cache our pivot value */
        pivot = array[MID (lo, hi)];
        
        i = lo;
        k = hi;
        
        do {
            /* find the first element with a value >= pivot value */
            while (i < k && array[i] < pivot)
                i++;
            
            /* find the last element with a value <= pivot value */
            while (k > i && array[k] > pivot)
                k--;
            
            if (i <= k) {
                SWAP (i, k);
                i++;
                k--;
            } else {
                break;
            }
        } while (1);
        
        if (lo < k) {
            /* push the first partition onto our stack */
            sp->lo = lo;
            sp->hi = k;
            sp++;
        }
        
        if (i < hi) {
            /* push the second partition onto our stack */
            sp->lo = i;
            sp->hi = hi;
            sp++;
        }
    } while (sp > stack);
}

Once again, we see a slight improvement in execution time: 5.10s. Didn't seem to help all that much, but was worth giving it a shot. Might be more beneficial on non-x86 platforms, though...

As I mentioned in the opening paragraph, one of the major problems with Quick Sort is that for already sorted (or nearly sorted) inputs, Quick Sort hits its pathological corner case and becomes extremely inefficient compared to other sorts. One of the more popular solutions to this problem (as used by most libc qsort() implementations) is to fall back to Insertion Sort once an input segment is small enough. I'll leave this up to the reader to consider ways of implementing this approach.

To learn more about sorting, I would recommend reading Art of Computer Programming, Volume 3: Sorting and Searching (2nd Edition)

25 Byte Sort Routine

On the topic of sorting, I came across something cool the other day in Michael Abrash's Graphics Programming Black Book that I thought I'd share: a 25-byte long integer sort routine (that is to say, the routine itself is only 25 bytes long).

Here it is, your Moment of Zen:

;---------------------------------------------------------------
; Sorts an array of ints. C-callable (small model). 25 bytes.
; void sort (int n, int a[]);
;
; Courtesy of David Stafford.
;---------------------------------------------------------------

      .model small
      .code
        public _sort

top:    mov     dx,[bx]     ; swap two adjacent ints
        xchg    dx,[bx+2]
        xchg    dx,[bx]

        cmp     dx,[bx]     ; did we put them in the right order?
        jl      top         ; no, swap them back

        inc     bx          ; go to the next integer
        inc     bx
        loop    top

_sort:  pop     dx          ; get return address (entry point)
        pop     cx          ; get count
        pop     bx          ; get pointer
        push    bx          ; restore pointer
        dec     cx          ; decrement count
        push    cx          ; save count
        push    dx          ; save return address
        jg      top         ; if cx > 0

        ret

      end

Sunday, February 25, 2007

Merge Sort

Note: This is a repost from an old weblog.

Like Binary Insertion Sort, Merge Sort is a recursive algorithm, but Merge Sort is O(n lg n) whereas Binary Insertion Sort had been O(n²). It works by splitting up the set into halves until each set is down to 1 item. It then merges them back together again in sorted order. It's a very simple concept although a bit hard to explain in words. Allow me to illustrate instead:

Given: [6 8 1 3 7 2 5 4]

The first thing we do is split this set in half:

[6 8 1 3] [7 2 5 4]

We keep splitting in half until we can't split anymore:

[6 8] [1 3] [7 2] [5 4]

[6] [8] [1] [3] [7] [2] [5] [4]

We now merge the items back into their previous sets in sorted order:

[6 8] [1 3] [2 7] [4 5]

And again:

[1 3 6 8] [2 4 5 7]

And again:

[1 2 3 4 5 6 7 8]

Not so bad, right? So how do we implement this? Well, breaking the sets into halves is fairly straight forward so I won't bother explaining that. The “hard” part is the merging, and even that's not so hard. Let me illustrate step by step how we merged 2 sets back into a single set:

set #1: [2 7]
set #2: [4 5]

We start by comparing the first item in each set. Is 2 less than 4? Yes, so we put 2 into our “merged” set:

set #1: [7]
set #2: [4 5]
merged: [2]

Now compare 7 and 4. Since 4 is less than 7, we append 4 into our merged set:

set #1: [7]
set #2: [5]
merged: [2 4]

Now compare 7 and 5. Since 5 is less than 7, we append 5 into our merged set:

set #1: [7]
set #2: []
merged: [2 4 5]

Since set #2 is out of items, we just append the remainder of set #1 into our merged set:

merged: [2 4 5 7]

Tada!

As you can probably see, it looks like we'll need a temporary array to hold our merged set as we merge the 2 subsets back together. If we want to keep to the same API we've been using for the previous sorting routines, we'll need a second function that will do most of the work. Since we know we'll never need our temporary array to be larger than the original input array, we might as well create it in our main function and pass it off to our recursive worker function:

void
MergeSort (int a[], int n)
{
    int *b;

    if (!n || !(b = malloc (sizeof (int) * n)))
        return;

    msort (a, b, 0, n - 1);

    free (b);
}

As you can see here, I called my worker function msort(), which takes as input the original array, the temporary array (b[]), the offset of the first item, and the offset of the last item. Now let's take a look at msort()'s implementation:

static void
msort (int a[], int b[], int lo, int hi)
{
    register int h, i, l;
    int mid;

    if (lo >= hi)
        return;

    mid = lo + ((hi - lo) / 2);
    msort (a, b, lo, mid);
    msort (a, b, mid + 1, hi);

    for (i = 0, l = lo, h = mid + 1; l <= mid && h <= hi; i++) {
        if (a[l] <= a[h])
            b[i] = a[l++];
        else
            b[i] = a[h++];
    }

    while (l <= mid)
        b[i++] = a[l++];
    while (h <= hi)
        b[i++] = a[h++];

    memcpy (a + lo, b, sizeof (int) * i);
}

You'll notice that the first thing we do is find the midpoint of our set starting with lo and ending with hi. We then call msort() on each of the 2 resulting subsets (lo to mid, and mid + 1 to hi). After msort() is finished with each of those subsets, the 2 subsets will be in sorted order and will be stored back in array a[] at positions lo through mid and mid + 1 through hi.

We then need to merge the 2 subsets back together again into a sorted set starting at position lo and ending with position hi. We'll use i to represent the next position to add an item into our temporary array, b[]. We'll use l and h to represent the cursor for each of the subsets (l for low and h for high). Notice that I check a[l] <= a[h] rather than a[l] > a[h], this is so that items with the same value in the array are in the same order they were in originally. For integer arrays, this is no big deal – but it is often a preferred feature when sorting arrays of custom data. Sorting algorithms that have this property are called “stable sorts”.

The for-loop merges 1 item from one of the 2 subsets into the temporary array until one of the subsets becomes empty. At this point, one of the 2 while-loops will be executed, appending the remainder of the non-empty set to the merged set.

Now that we've merged the 2 subsets back together into our temporary array, b[], we need to copy the result back into a[] at the subrange we started with.

And there you have it.

Since I'm sure you're as anxious to find out how fast this is compared to our previous sorts for sorting 100,000 items as I am, let's give it a whirl and find out. Wow, that was fast – it averaged 0.044s for sorting that 100,000 item array of random integers. That's the fastest time yet! Now that we have such a fast algorithm for sorting, we'll need to bump up our array size in order to compare further sorting algorithms. Let's go for 10,000,000. This time I get 6.344s which is still faster than most of our previous sort algorithms at 100,000 items!

The first optimization I see is that we could replace those while-loops with memcpy()'s. Turns out, that doesn't really change the results for our particular input (it might make a difference if the majority of values of one set or the other was larger than those in the other, but this wouldn't be the case for our input).

To learn more about sorting, I would recommend reading Art of Computer Programming, Volume 3: Sorting and Searching (2nd Edition)

Update: I've written a follow-up article on Optimizing Merge Sort that may be of interest as well. This follow-up article deals with optimizing a general-purpose Merge Sort implementation to be on par with the libc qsort() function.

Shell Sort

Note: This is a repost from an old weblog.

The Shell Sort algorithm is designed to move items over large distances each iteration. The idea behind this is that it will get each item closer to its final destination quicker saving a lot of shuffling by comparing items farther apart.

The way it works is it subdivides the dataset into smaller groups where each item in the group is a set distance apart. For example, if we use h to represent our distance and R to represent an item, we might have groups: { R₁, R_h+1, R_2h+1, ... }, { R₂, R_h+2, R_2h+2, ... }, ...

We then sort each subgroup individually.

Keep repeating the above process, continually reducing h, until h becomes 1. After one last run-through where h is a value of 1, we stop.

At this point, I'm sure you are wondering where h comes from and what values to reduce it by each time though. If and when you ever figure it out, let me know - you might also want to publish a paper and submit it to ACM so that your name might go down in the history (and algorithm) books. That's right, as far as I'm aware, no one knows the answer to this question, except, perhaps, God Himself :-)

If you are interested, you might take a look at Donald Knuth's book, Art of Computer Programming, Volume 3: Sorting and Searching (starting on page 83), for some mathematical discussion on the matter.

As far as I understand from Sorting and Searching, it is theoretically possible to get the Shell Sort algorithm to approach O(n^1.5) given an ideal increment table which is quite impressive.

Knuth gives us a couple of tables to start with: [8 4 2 1] and [7 5 3 1] which seem to work okay, but are far from being ideal for our 100,000 item array that we are trying to sort in this exercise, however, for the sake of keeping our first implementation simple, we'll use the [7 5 3 1] table since it has the charming property where each increment size is 2 smaller than the previous. Yay for simplicity.

void
ShellSort (int a[], int n)
{
    int h, i, j;
    int tmp;

    for (h = 7; h >= 1; h -= 2) {
        for (i = h; i < n; i++) {
            tmp = a[i];
            for (j = i; j >= h && a[j - h] > tmp; j -= h)
                a[j] = a[j - h];
            a[j] = tmp;
        }
    }
}

The nice thing about Shell Sort is that, while the increment table is a complete mystery, the algorithm itself is quite simple and well within our grasp.

Let's plug this into our sort program and see how well it does.

I seem to get a pretty consistent 6.3 seconds for 100,000 items on my AMD Athlon XP 2500 system which is almost as good as the results we were getting from our Binary Insertion Sort implementation.

Now for some optimizations. We know that an ideal set of increment sizes will get us down to close to O(n^1.5) and that it is unlikely that the [7 5 3 1] set is ideal, so I suggest we start there.

On a hunch, I just started adding more and more primes to our [7 5 3 1] table and noticed that with each new prime added, it seemed to get a little faster. At some point I decided to experiment a bit and tried using a set of primes farther apart and noticed that with a much smaller set of increments, I was able to get about the same performance as my much larger set of primes. This spurred me on some more and I eventually came up with the following set:

{ 14057, 9371, 6247, 4177, 2777, 1861, 1237, 823, 557, 367, 251, 163, 109, 73, 37, 19, 11, 7, 5, 3, 1 }

In order to use this set, however, we need a slightly more complicated method for determining the next value for h than just h -= 2, so I used a lookup table instead:

static int htab[] = { 14057, 9371, 6247, 4177, 2777, 1861, 1237, 823, 557, 367, 251, 163, 109, 73, 37, 19, 11, 7, 5, 3, 1, 0 };

void
ShellSort (int a[], int n)
{
    int *h, i, j;
    int tmp;

    for (h = htab; *h; h++) {
        for (i = *h; i < n; i++) {
            tmp = a[i];
            for (j = i; j >= *h && a[j - *h] > tmp; j -= *h)
                a[j] = a[j - *h];
            a[j] = tmp;
        }
    }
}

With this new table of increments, I was able to achieve an average sort time of about 0.09 seconds. In fact, ShellSort() in combination with the above increment table will sort an array of 2 million items in about the same amount of time as our optimized BinaryInsertionSort() took to sort 100 thousand items. That's quite an improvement, wouldn't you say!?

To learn more about sorting, I would recommend reading Art of Computer Programming, Volume 3: Sorting and Searching (2nd Edition)

Binary Insertion Sort

Note: this is a repost from an old weblog.

The Binary variant of Insertion Sort uses a binary search to find the appropriate location to insert the new item into the output. Before we can implement this variant of the insertion sort, we first need to know and understand how the binary search algorithm works:

Binary Search

Binary searching, for those who don't know (no need to raise your hand in shame), is a method of searching (duh) in which you attempt to compare as few items in the dataset to the item you are looking for as possible by halving your dataset with each comparison you make. Hence the word binary (meaning two). Allow me to illustrate:

Given the following dataset of 9 items in sorted order, find the value 8:

[1 2 3 4 5 6 7 8 9]

The first step is to check the midpoint (there are 9 elements, so the midpoint would be (9+1)/2 which is the 5th element):

[1 2 3 4 5 6 7 8 9]

Well, 5 is not 8 so we're not done yet. Is 8 less than 5? No. Is 8 greater than 5? Yes. So we know that the value we seek is in the second half of the array (if it's there). We are now left with:

[6 7 8 9]

Oops. 4 does not have a midpoint, so we round off any way we want to (it really doesn't matter). (4+1)/2 = 2.5, so for the sake of argument lets just drop the decimal and check the 2nd element (this is how integer math works anyway, so it's simpler):

[6 7 8 9]

Well, 7 is not 8 so we're not done yet. Is 8 less than 7? No. Is 8 greater than 7? Yes. So we know that the value we seek is in the second half of the array (if it's there). We are now left with:

[8 9]

Again we find the midpoint, which in this case is 1.

[8 9]

Does 8 equal 8? Yes! We're done!

How many tries did that take us? Three. Three tries and we found what we were looking for. And, in the worst possible case for this particular array (searching for the 9), we would have been able to do it in 4 tries.

Not bad, huh?

To implement our BinarySearch() function, the easiest way will be to write it using recursion. Recursion is a technique that allows us to break the problem into smaller and smaller pieces (or, the array in this case - as illustrated above) until we arrive at the solution (or, in this case, the item we are looking for). Recursion is implemented by having a function call itself to process a subset of the data, which, may again call itself (repeatedly until the problem is solved).

Notice in my above explanation of how to search the 9-item array using the binary search algorithm, how I continually break the array into smaller and smaller subsets? That's why binary search is often described as a recursive algorithm - you keep repeating the the process on smaller and smaller subsets until your subset cannot get any smaller or until you find what you are looking for.

Hopefully I've explained that well enough, if not - perhaps taking a look at the following implementation will help?

int
BinarySearch (int a[], int low, int high, int key)
{
    int mid;

    mid = low + ((high - low) / 2);

    if (key > a[mid])
        return BinarySearch (a, mid + 1, high, key);
    else if (key < a[mid])
        return BinarySearch (a, low, mid, key);

    return mid;
}

Note: To get the midpoint, we use the formula low + ((high - low) / 2) instead of (high + low) / 2 because we want to avoid the possibility of an integer overflow.

In my above implementation, a[] is our integer array, low is the low-point of the subset in which we are looking, high is the high-point of the subset in which we are looking, and key is the item that we are looking for. It returns the integer index of the value we are looking for in the array. Here's how we would call this function from our program: BinarySearch (a, 0, 9, 8);

Remember that in C, array indexes start at 0. We pass 9 as the high-point because there are 9 elements in our array. I don't think I need to explain why we pass 8 as the key.

If we then plug this into our Insertion Sort algorithm from yesterday (instead of doing a linear search), we get a Binary Insertion Sort... almost. There's one minor change we have to make to our BinarySearch() function first - since we are not necessarily trying to find an already-existing item in our output, we need to adjust BinarySearch() to return the ideal location for our key even in the event that it doesn't yet exist. Don't worry, this is a very simple adjustment:

int
BinarySearch (int a[], int low, int high, int key)
{
    int mid;

    if (low == high)
        return low;

    mid = low + ((high - low) / 2);

    if (key > a[mid])
        return BinarySearch (a, mid + 1, high, key);
    else if (key < a[mid])
        return BinarySearch (a, low, mid, key);

    return mid;
}

Okay, now we're ready to plug it in. Here's the result:

void
BinaryInsertionSort (int a[], int n)
{
    int ins, i, j;
    int tmp;

    for (i = 1; i < n; i++) {
        ins = BinarySearch (a, 0, i, a[i]);
        tmp = a[i];
        for (j = i - 1; j >= ins; j--)
            a[j + 1] = a[j];
        a[ins] = tmp;
    }
}

There's a couple of optimizations we can make here. If ins is equal to i, then we don't need to shift any items nor do we need to set a[ins] to the value in a[i], so we can wrap those last 4 lines in an if-statement block:

void
BinaryInsertionSort (int a[], int n)
{
    int ins, i, j;
    int tmp;

    for (i = 1; i < n; i++) {
        ins = BinarySearch (a, 0, i, a[i]);
        if (ins < i) {
            tmp = a[i];
            for (j = i - 1; j >= ins; j--)
                a[j + 1] = a[j];
            a[ins] = tmp;
        }
    }
}

If you remember from our Insertion Sort optimizations, the memmove() function really helped speed up shifting a lot of items all at once. But before we do that, lets see what kind of times our current implementation takes to sort 100,000 items (just so we have something to compare against after we make that memmove() optimization).

On my AMD Athlon XP 2500, it takes roughly 25 seconds to sort 100,000 random items (using the main() function from our Bubble Sort analysis a few days ago). That's already on par with yesterday's optimized insertion sort implementation and we haven't even plugged in memmove() yet!

I'm pretty excited and can't wait to plug in memmove() to see what kind of results we get, so here we go:

void
BinaryInsertionSort (int a[], int n)
{
    int ins, i, j;
    int tmp;

    for (i = 1; i < n; i++) {
        ins = BinarySearch (a, 0, i, a[i]);
        if (ins < i) {
            tmp = a[i];
            memmove (a + ins + 1, a + ins, sizeof (int) * (i - ins));
            a[ins] = tmp;
        }
    }
}

After plugging in memmove(), our BinaryInsertionSort() sorts 100,000 items in about 5.5 seconds! Whoo!

Lets recap: We started off with the goal of sorting 100,000 items using Bubble Sort which took over over 103 seconds to accomplish. We then scratched our heads and optimized the Bubble Sort algorithm the best we could and obtained a result that was twice as fast. Then we traded our BubbleSort() implementation in for another O(n²) algorithm (generally not something I would suggest wasting time on, but for the sake of learning about a bunch of sorting algorithms, why not?). Using InsertionSort(), we got that down to 26.5 seconds which is 4 times faster than our original sort! But wait, then we took our InsertionSort() one step farther and used a binary search algorithm to help us find the ideal location in which to insert our item and we got it down to a whopping 5.5 seconds - 20 times faster than what we started with!

That's pretty awesome if you ask me...

There's still one more optimization we can try on our BinaryInsertionSort() implementation to attempt to get faster speeds, but it may be over your head if you're a beginner so feel free to ignore my following code dump:

void
BinaryInsertionSort (int a[], int n)
{
    register int i, m;
    int hi, lo, tmp;

    for (i = 1; i < n; i++) {
        lo = 0, hi = i;
        m = i / 2;

        do {
            if (a[i] > a[m]) {
                lo = m + 1;
            } else if (a[i] < a[m]) {
                hi = m;
            } else
                break;

            m = lo + ((hi - lo) / 2);
        } while (lo < hi);
 
        if (m < i) {
            tmp = a[i];
            memmove (a + m + 1, a + m, sizeof (int) * (i - m));
            a[m] = tmp;
        }
    }
}

The basic idea here was that I wanted to get rid of recursion because, unfortunately, as nice as recursion is for visualizing (and implementing) our binary search, there is a cost penalty for every time we recurse - there is both function call overhead (which slows us down) and a memory overhead (we need to grow the stack). If we're clever enough, we can sometimes bypass the need for recursion like I did above which can often lead to significant performance gains.

In this case, however, we didn't improve the performance all that much (down from 5.5 to 5.2 seconds) but on other architectures (such as SPARC), it may be a lot more noticeable. It may also be a lot more noticeable for larger datasets. In fact, lets try sorting 200,000 items and see what sort of times we get.

Ah, much more noticeable now. With recursion, it takes 46 seconds and without, it takes 37 seconds - so there is definitely an advantage to cleverly working around the need for recursion.

Okay, that's all I've got for Binary Insertion Sort, but there are still plenty more sorting algorithms to try before we call it a day - we've just barely scratched the surface - and so far we've only looked into the O(n²) algorithms - there are others that promise O(n lg n) performance!

To learn more about sorting, I would recommend reading Art of Computer Programming, Volume 3: Sorting and Searching (2nd Edition)

Insertion Sort

Note: this is a repost from an old weblog.

Insertion Sort is another O(n²) sorting algorithm (just like Bubble Sort was) but is a little bit faster. Now, ideally, when trying to optimize by picking a new algorithm - you don't want to go from one slow algorithm to another, but I figure while I'm explaining sorting techniques - I might as well explain a bunch of them. Anyways... it turns out that insertion sort works out to be faster than Bubble Sort in the general case.

The way Insertion Sort works, is you move one item from your input dataset into an output dataset one item at a time. Take an item from the input and place it in the output at the proper position such that the output dataset is always in sorted order. Allow me to demonstrate:

Given: [5 8 2 9 6 3 7 1 0 4]

That's our input dataset. Our output dataset is so far empty. The first thing we do, is to take the first item of our input dataset (or any item, really, but we might as well grab them out of the input dataset in order, right?) and place it in the output dataset. Since the output dataset will only have 1 item this first round, there's nothing to sort. The result is:

Input: [8 2 9 6 3 7 1 0 4] Output: [5]

Now take the next input item and place it in the output, but in the proper sorted order:

Input: [2 9 6 3 7 1 0 4] Output: [5 8]

Lather. Rinse. Repeat.

Input: [9 6 3 7 1 0 4] Output: [2 5 8]
Input: [6 3 7 1 0 4] Output: [2 5 8 9]
Input: [3 7 1 0 4] Output: [2 5 6 8 9]
Input: [7 1 0 4] Output: [2 3 5 6 8 9]
Input: [1 0 4] Output: [2 3 5 6 7 8 9]
Input: [0 4] Output: [1 2 3 5 6 7 8 9]
Input: [4] Output: [0 1 2 3 5 6 7 8 9]
Input: [] Output: [0 1 2 3 4 5 6 7 8 9]

Tada! We have a sorted dataset. This is the basis of all insertion sort variants, the difference between the various insertion sorts (I only know of 2, but wouldn't be surprised if there were more) is the way in which we find the ideal place to insert the new item from the input into the output.

Before I get into the different variants, however, let me address one of the concerns you likely have thus far: if we're sorting a large-ish array (note: if we are sorting a linked list, then this isn't an issue), we don't want to have two arrays because that doubles the amount of memory we're using!

Indeed, but if you'll notice: as the output size increases, the input size decreases - together, both the input and output sizes total the same number of elements and if we iterate through the input elements in order, then we can share the same array as both the input and output arrays. Let me re-illustrate:

Output | Input
[][5 8 2 9 6 3 7 1 0 4]
[5 ][8 2 9 6 3 7 1 0 4]
[5 8 ][2 9 6 3 7 1 0 4]
[2 5 8 ][9 6 3 7 1 0 4]
...

See how we can cheat now? Visualization is the key to solving so many problems. Don't be afraid to take a pencil and paper and "draw" the problem - many times the solution will present itself.

Linear Insertion Sort

Generally when someone refers to Insertion Sort, this is the variant that they are talking about. As I was saying above, the difference between the variants is how they find the proper position in which to insert the new item into the output. As the name suggests, Linear insertion sort uses a linear search in order to find this position. Basically, this comes down to scanning through the output array starting at the first position and comparing the values: if the new item is larger than the first output value, move to the next output item and so on until the new item's value is less than the item we are comparing against in the output - once we find that, we have found the position to insert the new item.

Note: we could actually get away with using <= to minimize the number of comparisons we have to do, but then if we have multiple items with the same values, they won't remain in the same order they were in in the input, but if that doesn't bother you, then definitely use <=. Sorts that guarantee that items in the output with the same comparison values are in the same order as they were in the input are called "stable sorts". This is usually a pretty desirable trait (unless performance is more important). The lucky thing for us, though, is that in this particular case, keeping duplicate items in the same order we found them in the input, would force us to work backwards starting at the end of the output array rather than working forwards starting at the beginning of the output array which just so happens to allows us a more promising optimization. Pretty crafty, eh? I thought so...

One thing I should remind you about is that when you insert an item into the output, you'll need to shift all the items after that position one place to the right.

Okay, on to the implementation...

The first optimization you should notice is that after the first "loop", the first item of the input is always the first item of the output, so we might as well start with the second item of the input array. So with that, go ahead and implement this algorithm - you should get something similar to this:

void
InsertionSort (int a[], int n)
{
    int i, j = 0;
    int key;

    for (i = 1; i < n; j = i, i++) {
        key = a[i];
        while (j >= 0 && a[j] > key) {
            a[j + 1] = a[j];
            j--;
        }
        a[j + 1] = key;
    }
}

You'll notice that I took the "work backwards starting at the end of the output array" approach that I noted above. As I started to mention, this approach has the added benefit of allowing us to shift the output items as we go which means we get to shift fewer items per loop on average than if we had worked in the forward direction, resulting in a performance gain (shifting a bunch of items is more expensive than an integer comparison).

If we feed this into our sort program and have main() call this new function rather than the BubbleSort() implementation we wrote yesterday, we find that we have a sort that is significantly faster - down to 36 seconds from the optimized BubbleSort()'s 51 seconds. Not bad...

See any obvious improvements we can make that might result in a significantly faster implementation of our InsertionSort() implementation? I don't see anything I would consider major, but we might try using memmove() instead of manually moving items one space to the right. We'll get the most bang for the buck if we wait until we find our optimal insertion point before moving anything around, so:

void
InsertionSort (int a[], int n)
{
    int i, j = 0;
    int key;

    for (i = 1; i < n; j = i, i++) {
        key = a[i];
        while (j >= 0 && a[j] > key)
            j--;
        memmove (a + (j + 2), a + (j + 1), sizeof (int) * ((i - 1) - j));
        a[j + 1] = key;
    }
}

Okay, so with that very simple adjustment, we got our total sort time from 36 seconds down to 26.5 seconds. Nothing to scoff at, surely, but at the same time not as drastic an improvement as changing algorithms had been (see why algorithms are so important for optimizations?).

Now just imagine if we used an algorithm that promised better than O(n²) performance!

Before we jump to something better than O(n²) though, I still have one more trick up my sleeve to improve our insertion sort algorithm (not implementation this time, but rather: algorithm). I dub thee: Binary Insertion Sort.

To learn more about sorting, I would recommend reading Art of Computer Programming, Volume 3: Sorting and Searching (2nd Edition)

Bubble Sort

Note: This is a repost from an old weblog.

Bubble Sort works by iterating through the dataset, comparing two neighboring items at a time, and swapping them if the first item is larger than the second item. In order to sort a dataset of N items, this operation must be done N-1 times. Let me illustrate:

Given:

We start by comparing the first two items, 5 and 8:

Since 5 is smaller than 8, we don't need to swap them.

Next, we compare the second set of items, 8 and 2:

8 is larger than 2, so we have to swap them, resulting in:

We keep doing this until we reach the end of the dataset and then we'll start all over again - repeating the process N-1 times (in this example, 9 times since we have 10 items).

At the end of our first loop, we end up with:

Now we start the whole process over again (we have 8 more loops to go!).

Notice anything interesting at the very end? You guessed it! That last comparison is pretty worthless - the last item in the dataset is always going to be the largest value after the first loop, right? Right, but that's not all - each loop through the items, we can ignore more and more items from the end of the dataset (after the second loop, we can ignore the last 2 items; after the third loop, we can ignore the last 3 items and so on). That little trick is our first optimization for this algorithm!

Let's see what the code would look like:

void
BubbleSort (int a[], int n)
{
    int i, j, tmp;
    
    for (i = n - 1; i >= 0; i--) {
        for (j = 0; j < i; j++) {
            if (a[j] > a[j + 1]) {
                tmp = a[j];
                a[j] = a[j + 1];
                a[j + 1] = tmp;
            }
        }
    }
}

Now that we have some code to test, lets see how long it takes to sort a random array of 100,000 integers. To do that, we'll need to write a little program to use our BubbleSort() routine:

#include <stdlib.h>
#include <time.h>

int main (int argc, char **argv)
{
    int array[100000], i;
    
    srand (time (NULL));
    
    for (i = 0; i < 100000; i++)
        array[i] = rand ();
    
    BubbleSort (array, 100000);
    
    return 0;
}

We'll just use our system's time command to get an estimate of how long it takes to sort.

Go ahead and run our little program a few times. For me, it seems to average about 103 seconds on my AMD Athlon XP 2500.

I don't know about you, but I saw a pretty obvious optimization that we could make to our BubbleSort() implementation that might make it a bit faster. Remember how we noticed that each time through the inner loop, the net result was that the largest item was moved all the way to the right (discounting the largest items from previous loops)? What if, instead of swapping each time through the inner loop, we waited until the inner loop finished and then swapped? Let's try it:

void
BubbleSort (int a[], int n)
{
    int i, j, max, tmp;
    
    for (i = n - 1; i >= 0; i--) {
        for (max = 0, j = 1; j < i + 1; j++) {
            if (a[j] > a[max])
                max = j;
        }
        
        if (max < i) {
            tmp = a[max];
            a[max] = a[i];
            a[i] = tmp;
        }
    }
}

In the above code, we use max to hold the index of the largest item we find. We initialize it to the index of the first item (0 in languages like C) and then start our inner for-loop. You'll notice a change here: instead of starting j at 0, we start it at 1 because we've already "looked" at the first item. Also, we need to loop the same number of times - so we continue to iterate as long as j is less than i + 1 (rather than j < i of the previous implementation).

If I now compile our benchmarking program to use this new BubbleSort() implementation, we notice a huge performance increase - or at least I did on my machine. The average execution time for this new implementation seems to be about 51 seconds. That sure beats the pants off 103 seconds, doesn't it?

Even so, 51 seconds to sort 100,000 items is a long wait, especially since they are just simple integers. There's only one more optimization that I can think of. If you think about how bubble sort works, each time through the outer loop, you notice that if we ever go through one of those N-1 iterations without having to swap any items that it is safe to conclude that we're done and so can skip performing the remainder of the N-1 iterations. I'll leave this optimization as an exercise for my readers rather than posting a new implementation here.

Since I don't see any other real obvious optimizations that we can make which would drastically improve performance, I think it's time we consider another sorting algorithm.

To learn more about sorting, I would recommend reading Art of Computer Programming, Volume 3: Sorting and Searching (2nd Edition)

A Moment of Zen

Thursday, April 7, 2011

Optimizing Merge Sort

Saturday, March 3, 2007

Quick Sort

25 Byte Sort Routine

Sunday, February 25, 2007

Merge Sort

Shell Sort

Binary Insertion Sort

Insertion Sort

Bubble Sort

Code Snippet Licensing

Links

Blog Portals