Showing posts with label optimization. Show all posts

Monday, March 10, 2014

GMime gets a Speed Boost

With all of the performance improvements I've been putting into MimeKit recently, it was about time to port some of these optimizations back to GMime.

In addition to other fixes that were in the queue, GMime 2.6.20 includes the "SIMD" optimization hack that I blogged about doing for MimeKit and I wanted to share the results. Below is a comparison of GMime 2.6.19 and 2.6.20 parsing the same 2GB mbox file on my 2011 Core-i5 iMac with the "persistent stream" option enabled on the GMimeParser:

[fejj@localhost gmime-2.6.19]$ ./gmime-mbox-parser really-big.mbox
Parsed 29792 messages in 5.15 seconds.

[fejj@localhost gmime-2.6.20]$ ./gmime-mbox-parser really-big.mbox
Parsed 29792 messages in 4.70 seconds.

That's a pretty respectable improvement. Interestingly, though, it's still not as fast as MimeKit utilizing Mono's LLVM backend:

[fejj@localhost MimeKit]$ mono --llvm ./mbox-parser.exe really-big.mbox
Parsed 29792 messages in 4.52 seconds.

Of course, to be fair, without the --llvm option, MimeKit doesn't fare quite so well:

[fejj@localhost MimeKit]$ mono ./mbox-parser.exe really-big.mbox
Parsed 29792 messages in 5.54 seconds.

I'm not sure what kind of optimizations LLVM utilizes when used from Mono vs clang (used to compile GMime via homebrew, which I suspect uses -O2), but nevertheless, it's still very impressive.

After talking with Rodrigo Kumpera from the Mono runtime team, it sounds like the --llvm option is essentially the -O2 optimizations minus a few of the options that cause problems with the Mono runtime, so effectively somewhere between -O1 and -O2.

I'd love to find out why MimeKit with the LLVM optimizer is faster than GMime compiled with clang (which also makes use of LLVM) with the same optimizations, but I think it'll be pretty hard to narrow down exactly because MimeKit isn't really a straight port of GMime (they are similar, but a lot of MimeKit is all-new in design and implementation).

Monday, February 3, 2014

Introducing MailKit, a cross-platform .NET mail-client library

Once I announced MimeKit, I knew it would only be a matter of time before I started getting asked about SMTP, IMAP, and/or POP3 support.

Let's just say,

Challenge... ACCEPTED!

I started off back in early December writing an SmtpClient so that developers using MimeKit wouldn't have to convert a MimeMessage to a System.Net.Mail.MailMessage in order to send it using System.Net.Mail.SmtpClient. This went pretty quickly because I've implemented several SMTP clients in the past. Implementing the various SASL authentication mechanisms probably took as much or more time than implementing the SMTP protocol.

The following weekend, I ended up implementing a Pop3Client. Originally, I had planned on more-or-less cloning the API we had used in Evolution, but I decided that I would take a different approach. I designed a simple IMessageSpool interface which more closely follows the limited functionality of POP3 and mbox spools instead of trying to map the Pop3Client to a Store/Folder paradigm like JavaMail and Evolution do (Evolution's mail library was loosely based on JavaMail). Mapping mbox and POP3 spools to Stores and Folders in Evolution was, to my recollection, rather awkward and I wanted to avoid that with MailKit.

At first I was loathe to do it, but over the past 2 weeks I ended up writing an ImapClient as well. I'm sure Philip van Hoof will be pleased to note that I have a very nice BODYSTRUCTURE parser, although that API is not publicly exported.

Unlike the SmtpClient and Pop3Client, the ImapClient does not have all of its functionality on a single public class. Instead, ImapClient implements an IMessageStore which has a limited API, mostly meant for getting IFolders. I imagine that those who are familiar with the JavaMail and/or Evolution (Camel) APIs will recognize this design.

The IFolder interface isn't designed to be exactly like the JavaMail Folder API, though. I've been designing the interface incrementally as I implement the various IMAP extensions (I've found at least 37 of them at the time of this blog post, although I don't think I'll bother with ACL, MAILBOX-REFERRAL, or LOGIN-REFERRAL), so the API may continue to evolve as I go, but I think what I've got now will likely remain - I'll probably just be including additional APIs for the new stuff.

So far, I've implemented the following IMAP extensions: LITERAL+, NAMESPACE, CHILDREN, LOGIN-DISABLED, STARTTLS, MULTIAPPEND, UNSELECT, UIDPLUS, CONDSTORE, ESEARCH, SASL-IR, SORT, THREAD, SPECIAL-USE, MOVE, XLIST, and X-GM-EXT1. Phew, that was exhausting listing all of those!

Also news-worthy is that MimeKit is now equally as fast as GMime, which is pretty impressive considering that it is fully managed C# code.

Download MailKit 0.2 now and let the hacking begin!

Monday, October 7, 2013

Optimization Tips & Tricks used by MimeKit: Part 2

In my previous blog post, I talked about optimizing the most critical loop in MimeKit's MimeParser by:

Extending our read buffer by an extra byte (which later became 4 extra bytes) that I could set to '\n', allowing me to do the bounds check after the loop as opposed to in the loop, saving us roughly half the instructions.
Unrolling the loop in order to check for 4 bytes at a time for that '\n' by using some bit twiddling hacks (for 64-bit systems, we might gain a little more performance by checking 8 bytes at a time).

After implementing both of those optimizations, the time taken for MimeKit's parser to parse nearly 15,000 messages in a ~1.2 gigabyte mbox file dropped from around 10s to about 6s on my iMac with Mono 3.2.3 (32-bit). That is a massive increase in performance.

Even after both of those optimizations, that loop is still the most critical loop in the parser and the MimeParser.ScanContent() method, which contains it, is still the most critical method of the parser.

While the loop itself was a huge chunk of the time spent in that method, the next largest offender was writing the content of the MIME part into a System.IO.MemoryStream.

MemoryStream, for those that aren't familiar with C#, is just what it sounds like it is: a stream backed by a memory buffer (in C#, this happens to be a byte array). By default, a new MemoryStream starts with a buffer of about 256 bytes. As you write more to the MemoryStream, it resizes its internal memory buffer to either the minimum size needed to hold the its existing content plus whatever number of bytes your latest Write() was called with or double the current internal buffer size, whichever is larger.

The performance problem here is that for MIME parts with large amounts of content, that buffer will be resized numerous times. Each time that buffer is resized, due to the way C# works, it will allocate a new buffer, zero the memory, and then copy the old content over to the new buffer. That's a lot of copying and creates a situation where the write operation can become exponentially worse as the internal buffer gets larger. Since MemoryStream contains a GetBuffer() method, its internal buffer really has to be a single contiguous block of memory. This means that there's little we could do to reduce overhead of zeroing the new buffer every time it resizes beyond trying to come up with a different formula for calculating the next optimal buffer size.

At first I decided to try the simple approach of using the MemoryStream constructor that allows specifying an initial capacity. By bumping up the initial capacity to 2048 bytes, things did improve, but only by a very disappointing amount. Larger initial capacities such as 4096 and 8192 bytes also made very little difference.

After brainstorming with my coworker and Mono runtime hacker, Rodrigo Kumpera, we decided that one way to solve this performance problem would be to write a custom memory-backed stream that didn't use a single contiguous block of memory, but instead used a list of non-contiguous memory blocks. When this stream needed to grow its internal memory storage, all it would need to do is allocate a new block of memory and append it to its internal list of blocks. This would allow for minimal overhead because only the new block would need to be zeroed and no data would need to be re-copied, ever. As it turns out, this approach would also allow me to limit the amount of unused memory used by the stream.

I dubbed this new memory-backed stream MimeKit.IO.MemoryBlockStream. As you can see, the implementation is pretty trivial (doesn't even require scary looking bit twiddling hacks like my previous optimization), but it made quite a difference in performance. By using this new memory stream, I was able to shave a full second off of the time needed to parse that mbox file I mentioned earlier, getting the total time spent down to 5s. That's starting to get pretty respectable, performance-wise.

As a comparison, let's compare the performance of MimeKit with what seems to be the 2 most popular C# MIME parsers out there (OpenPOP.NET and SharpMimeTools) and see how we do. I've been hyping up the performance of MimeKit a lot, so it had better live up to expectations, right? Let's see if it does.

Now, since none of the other C# MIME parsers I could find support parsing the Unix mbox file format, we'll write some test programs to parse the same message stream over and over (say, 20 thousand times) to compare MimeKit to OpenPOP.NET.

Here's the test program I wrote for OpenPOP.NET:

using System;
using System.IO;
using System.Diagnostics;
using OpenPop.Mime;

namespace OpenPopParser {
    class Program
    {
        public static void Main (string[] args)
        {
            var stream = File.OpenRead (args[0]);
            var stopwatch = new Stopwatch ();

            stopwatch.Start ();
            for (int i = 0; i < 20000; i++) {
                var message = Message.Load (stream);
                stream.Position = 0;
            }
            stopwatch.Stop ();

            Console.WriteLine ("Parsed 20,000 messages in {0}", stopwatch.Elapsed);
        }
    }
}

Here's the SharpMimeTools parser I wrote for testing:

using System;
using System.IO;
using System.Diagnostics;
using anmar.SharpMimeTools;

namespace SharpMimeParser {
    class Program
    {
        public static void Main (string[] args)
        {
            var stream = File.OpenRead (args[0]);
            var stopwatch = new Stopwatch ();

            stopwatch.Start ();
            for (int i = 0; i < 20000; i++) {
                var message = new SharpMessage (stream);
                stream.Position = 0;
            }
            stopwatch.Stop ();

            Console.WriteLine ("Parsed 20,000 messages in {0}", stopwatch.Elapsed);
        }
    }
}

And here is the test program I used for MimeKit:

using System;
using System.IO;
using System.Diagnostics;
using MimeKit;

namespace MimeKitParser {
    class Program
    {
        public static void Main (string[] args)
        {
            var stream = File.OpenRead (args[0]);
            var stopwatch = new Stopwatch ();

            stopwatch.Start ();
            for (int i = 0; i < 20000; i++) {
                var parser = new MimeParser (stream, MimeFormat.Default);
                var message = parser.ParseMessage ();
                stream.Position = 0;
            }
            stopwatch.Stop ();

            Console.WriteLine ("Parsed 20,000 messages in {0}", stopwatch.Elapsed);
        }
    }
}

Note: Unfortunately, OpenPOP.NET's message parser completely failed to parse the Star Trek message I pulled out of my test suite at random (first message in the jwz.mbox.txt file included in MimeKit's UnitTests project) due to the Base64 decoder not liking some byte or another in the stream, so I had to patch OpenPOP.NET to no-op its base64 decoder (which, if anything, should make it faster).

And here are the results running on my 2011 MacBook Air:

[fejj@localhost OpenPopParser]$ mono ./OpenPopParser.exe ~/Projects/MimeKit/startrek.msg
Parsed 20,000 messages in 00:06:26.6825190

[fejj@localhost SharpMimeParser]$ mono ./SharpMimeParser.exe ~/Projects/MimeKit/startrek.msg
Parsed 20,000 messages in 00:19:30.0402064

[fejj@localhost MimeKit]$ mono ./MimeKitParser.exe ~/Projects/MimeKit/startrek.msg
Parsed 20,000 messages in 00:00:15.6159326

Whooooosh!

Not. Even. Close.

MimeKit is nearly 25x faster than OpenPOP.NET even after making its base64 decoder a no-op and 75x faster than SharpMimeTools.

Since I've been ranting against C# MIME parsers that made heavy use of regex, let me show you just how horrible regex is for parsing messages (performance-wise). There's a C# MIME parser called MIMER that is nearly pure regex, so what better library to illustrate my point? I wrote a very similar loop to the other 2 that I listed above, so I'm not going to bother repeating it again. Instead, I'll just skip to the results:

[fejj@localhost MimerParser]$ mono ./MimerParser.exe ~/Projects/MimeKit/startrek.msg
Parsed 20,000 messages in 00:16:51.4839129

Ouch. MimeKit is roughly 65x faster than a fully regex-based MIME parser. It's actually rather pathetic that this regex parser beats SharpMimeTools.

This is why, as a developer, it's important to understand the limitations of the tools you decide to use. Regex is great for some things but it is a terrible choice for others. As Jamie Zawinski might say,

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

Monday, September 30, 2013

Optimization Tips & Tricks used by MimeKit: Part 1

One of the goals of MimeKit, other than being the most robust MIME parser, is to be the fastest C# MIME parser this side of the Mississippi. Scratch that, fastest C# MIME parser in the World.

Seriously, though, I want to get MimeKit to be as fast and efficient as my C parser, GMime, which is one of the fastest (if not the fastest) MIME parsers out there right now, and I don't expect that any parser is likely to smoke GMime anytime soon, so using it as a baseline to compare against means that I have a realistic goal to set for MimeKit.

Now that you know the why, let's examine the how.

First, I'm using one of those rarely used features of C#: unsafe pointers. While that alone is not all that interesting, it's a corner stone for one of the main techniques I've used. In C#, the fixed statement (which is how you get a pointer to a managed object) pins the object to a fixed location in memory to prevent the GC from moving that memory around while you operate on that buffer. Keep in mind, though, that telling the GC to pin a block of memory is not free, so you should not use this feature without careful consideration. If you're not careful, using pointers could actually make your code slower. Now that we've got that out of the way...

MIME is line-based, so a large part of every MIME parser is going to be searching for the next line of input. One of the reasons most MIME parsers (especially C# MIME parsers) are so slow is because they use a ReadLine() approach and most TextReaders likely use a naive algorithm for finding the end of the current line (as well as all of the extra allocating and copying into a string buffer):

    // scan for the end of the line
    while (inptr < inend && *inptr != (byte) '\n')
        inptr++;

The trick I used in GMime was to make sure that my read buffer was 1 byte larger than the max number of bytes I'd ever read from the underlying stream at a given time. This allowed me to set the first byte in the buffer beyond the bytes I just read from the stream to '\n', thus allowing for the ability to remove the inptr < inend check, opting to do the bounds check after the loop has completed instead. This nearly halves the number of instructions used per loop, making it much, much faster. So, now we have:

    // scan for the end of the line
    while (*inptr != (byte) '\n')
        inptr++;

But is that the best we can do?

Even after using this trick, it was still the hottest loop in my parser:

We've got no choice but to use a linear scan, but that doesn't mean that we can't do it faster. If we could somehow reduce the number of loops and likewise reduce the number of pointer increments, we could eliminate a bunch of the overhead of the loop. This technique is referred to as loop unrolling. Here's what brianonymous (from the ##csharp irc channel on freenode) and I came up with (with a little help from Sean Eron Anderson's bit twiddling hacks):

    uint* dword = (uint*) inptr;
    uint mask;

    do {
        mask = *dword++ ^ 0x0A0A0A0A;
        mask = ((mask - 0x01010101) & (~mask & 0x80808080));
    } while (mask == 0);

And here are the results of that optimization:

Now, keep in mind that on many architectures other than x86, in order to employ the trick above, inptr must first be 4-byte aligned (uint is 32bit) or it could cause a SIGBUS or worse, a crash. This is fairly easy to solve, though. All you need to do is increment inptr until you know that it is 4 byte aligned and then you can switch over to reading 4 bytes at a time as in the above loop. We'll also need to figure out which of those 4 bytes contained the '\n'. An easy way to solve that problem is to just linearly scan those 4 bytes using our previous single-byte-per-loop implementation starting at dword - 1. Here it is, your moment of Zen:

    // Note: we can always depend on byte[] arrays being
    // 4-byte aligned on 32bit and 64bit architectures
    int alignment = (inputIndex + 3) & ~3;
    byte* aligned = inptr + alignment;
    byte* start = inptr;
    uint mask;

    while (inptr < aligned && *inptr != (byte) '\n')
        inptr++;

    if (inptr == aligned) {
        // -funroll-loops
        uint* dword = (uint*) inptr;

        do {
            mask = *dword++ ^ 0x0A0A0A0A;
            mask = ((mask - 0x01010101) & (~mask & 0x80808080));
        } while (mask == 0);

        inptr = (byte*) (dword - 1);
        while (*inptr != (byte) '\n')
            inptr++;
    }

Note: In this above code snippet, 'inputIndex' is the byte offset of 'inptr' into the byte array. Since we can safely assume that index 0 is 4-byte aligned, we can do a simple calculation to get the next multiple of 4 and add that to our 'inptr' to get the next 4-byte aligned pointer.

That's great, but what does all that hex mumbo jumbo do? And why does it work?

Let's go over this 1 step at a time...

    mask = *dword++ ^ 0x0A0A0A0A;

This xor's the value of dword with 0x0A0A0A0A (0x0A0A0A0A is just 4 bytes of '\n'). The xor sets every byte that is equal to 0x0A to 0 in mask. Every other byte will be non-zero.

    mask - 0x01010101

When we subtract 0x01010101 from mask, the result will be that only bytes greater than 0x80 will contain any high-order bits (and any byte that was originally 0x0A in our input will now be 0xFF).

    ~mask & 0x80808080

This inverts the value of mask resulting in no bytes having the highest bit set except for those that had a 0 in that slot before (including the byte we're looking for). By then bitewise-and'ing it with 0x80808080, we get 0x80 for each byte that was originally 0x0A in our input or otherwise had the highest bit set after the bit inversion.

Because there's no way for any byte to have the highest bit set in both sides of the encompassing bitwise-and except for the character we're looking for (0x0A), the mask will always be 0 unless any of the bytes within were originally 0x0A, which would then break us out of the loop.

Well, that concludes part 1 as it is time for me to go to bed so I can wake up at a reasonable time tomorrow morning.

Good night!

Thursday, April 7, 2011

Optimizing Merge Sort

A number of years ago I wrote about the Merge Sort algorithm. One of the advantages of Merge Sort is that it is a stable sort, meaning that elements that compare as being equal remain in their original order after being sorted.

Well, today I had need of employing a stable sorting routine for sorting elements by a ZIndex in Moonlight. Up until today, we had been using qsort() which, while not guaranteed to be a stable sort on any platform, happens to be implemented in glibc as a stable sort except in out-of-memory conditions. Since we'd like Moonlight to work on platforms other than Linux+glibc (such as Mac OS or BSD), it has become important enough to implement properly.

To start, I dusted off my generic MergeSort() implementation from years ago when I was writing articles about various sorting algorithms. This is what I had to start with:

#define MID(lo, hi) (lo + ((hi - lo) >> 1))

static void
msort (void *array, void *buf, size_t low, size_t high, size_t size,
       int (* compare) (const void *, const void *))
{
    register char *lo, *hi, *b;
    char *al, *am, *ah;
    size_t mid;
    
    mid = MID (low, high);
    
    if (mid + 1 < high)
        msort (array, buf, mid + 1, high, size, compare);
    
    if (mid > low)
        msort (array, buf, low, mid, size, compare);
    
    ah = ((char *) array) + ((high + 1) * size);
    am = ((char *) array) + ((mid + 1) * size);
    al = ((char *) array) + (low * size);
    
    b = (char *) buf;
    lo = al;
    hi = am;
    
    while (lo < am && hi < ah) {
        if (compare (lo, hi) <= 0) {
            memcpy (b, lo, size);
            lo += size;
        } else {
            memcpy (b, hi, size);
            hi += size;
        }
        
        b += size;
    }
    
    if (lo < am)
        memcpy (b, lo, am - lo);
    else if (hi < ah)
        memcpy (b, hi, (ah + size) - hi);
    
    memcpy (al, buf, ah - al);
}

int
MergeSort (void *base, size_t nmemb, size_t size,
           int (* compare) (const void *, const void *))
{
    void *tmp;
    
    if (nmemb < 2)
        return 0;
    
    if (!(tmp = malloc (nmemb * size))) {
        errno = ENOMEM;
        return -1;
    }
    
    msort (base, tmp, 0, nmemb - 1, size, compare);
    
    free (tmp);
    
    return 0;
}

Since performance is very important, I clocked this implementation against qsort() and got the following results on my Intel Core2 Quad Q6600 2.4 GHz machine using arrays of 10 million ints:

Randomized input: 14.13s vs qsort()'s 6.77s
Sorted input: 4.41s vs qsort()'s 1.54s
Reversed input: 4.26s vs qsort()'s 1.90s

Clearly the above MergeSort() implementation did not fare well against glibc's qsort() on my system, so it was time to look at what I could do to improve the performance.

The most obvious optimization I could see was to try and batch my memcpy() calls. In other words, instead of calling memcpy() to copy each and every element into our temporary buffer, it'd be more efficient to copy blocks of elements at a time:

static void
msort (void *array, void *buf, size_t low, size_t high, size_t size,
       int (* compare) (const void *, const void *))
{
    char *al, *am, *ah, *ls, *hs, *lo, *hi, *b;
    size_t mid;
    
    mid = MID (low, high);
    
    if (mid + 1 < high)
        msort (array, buf, mid + 1, high, size, compare);
    
    if (mid > low)
        msort (array, buf, low, mid, size, compare);
    
    ah = ((char *) array) + ((high + 1) * size);
    am = ((char *) array) + ((mid + 1) * size);
    al = ((char *) array) + (low * size);
    
    b = (char *) buf;
    lo = al;
    hi = am;
    
    do {
        ls = lo;
        hs = hi;
        
        if (lo > al || hi > am) {
            /* our last loop already compared lo & hi and found lo <= hi */
            lo += size;
        }
        
        while (lo < am && compare (lo, hi) <= 0)
            lo += size;
 
        if (lo > ls) {
            memcpy (b, ls, lo - ls);
            b += (lo - ls);
        }
 
        if (lo < am) {
            /* our last compare tells us hi < lo */
            hi += size;
            
            while (hi < ah && compare (hi, lo) < 0)
                hi += size;
            
            memcpy (b, hs, hi - hs);
            b += (hi - hs);
        }
    } while (lo < am && hi < ah);
    
    if (lo < am)
        memcpy (b, lo, am - lo);
    else if (hi < ah)
        memcpy (b, hi, ah - hi);
    
    memcpy (al, buf, ah - al);
}

The results were promising. For the exact same inputs (including the exact same random array), we now get:

Randomized input: 10.45s
Sorted input: 2.08s
Reversed input: 2.03s

The only other way that we can reduce the number of memcpy() calls we make is to avoid copying leading and trailing elements into our temporary buffer if it's not necessary to merge them. Here's the solution I came up with:

static void
msort (void *array, void *buf, size_t low, size_t high, size_t size,
       int (* compare) (const void *, const void *))
{
    char *a1, *al, *am, *ah, *ls, *hs, *lo, *hi, *b;
    size_t copied = 0;
    size_t mid;
    
    mid = MID (low, high);
    
    if (mid + 1 < high)
        msort (array, buf, mid + 1, high, size, compare);
    
    if (mid > low)
        msort (array, buf, low, mid, size, compare);
    
    ah = ((char *) array) + ((high + 1) * size);
    am = ((char *) array) + ((mid + 1) * size);
    a1 = al = ((char *) array) + (low * size);
    
    b = (char *) buf;
    lo = al;
    hi = am;
    
    do {
        ls = lo;
        hs = hi;
        
        if (lo > al || hi > am) {
            /* our last loop already compared lo & hi and found lo <= hi */
            lo += size;
        }
        
        while (lo < am && compare (lo, hi) <= 0)
            lo += size;
        
        if (lo < am) {
            if (copied == 0) {
                /* avoid copying the leading items */
                a1 = lo;
                ls = lo;
            }
            
            /* our last compare tells us hi < lo */
            hi += size;
            
            while (hi < ah && compare (hi, lo) < 0)
                hi += size;
            
            if (lo > ls) {
                memcpy (b, ls, lo - ls);
                copied += (lo - ls);
                b += (lo - ls);
            }
            
            memcpy (b, hs, hi - hs);
            copied += (hi - hs);
            b += (hi - hs);
        } else if (copied) {
            memcpy (b, ls, lo - ls);
            copied += (lo - ls);
            b += (lo - ls);
            
            /* copy everything we needed to re-order back into array */
            memcpy (a1, buf, copied);
            return;
        } else {
            /* everything already in order */
            return;
        }
    } while (hi < ah);
    
    if (lo < am) {
        memcpy (b, lo, am - lo);
        copied += (am - lo);
    }
    
    memcpy (a1, buf, copied);
}

Once again, reducing the amount of copying paid off:

Randomized input: 9.80s
Sorted input: 0.95s
Reversed input: 2.05s

Update 2011-05-18: One final optimization that can be tried is pre-calculating the optimum way to copy elements between buffers. This calculation, while not terribly expensive itself, adds up with every call to memcpy(). Let's start off by writing some handy macros:

#define COPYBY(TYPE, a, b, n) {         \
    long __n = (n) / sizeof (TYPE);     \
    register TYPE *__a = (TYPE *) (a);  \
    register TYPE *__b = (TYPE *) (b);  \
                                        \
    do {                                \
        *__a++ = *__b++;                \
    } while (--__n > 0);                \
}

#define MEMCOPY(dest, src, n) {                 \
    switch (copy_mode) {                        \
    case 1: COPYBY (long, dest, src, n); break; \
    case 2: COPYBY (int, dest, src, n); break;  \
    default: memcpy (dest, src, n);             \
    }                                           \
}

Now that these handy macros are written, we can plug them into our Merge Sort implementation:

static void
msort (void *array, void *buf, size_t low, size_t high, size_t size,
       int copy_mode, int (* compare) (const void *, const void *))
{
    char *a1, *al, *am, *ah, *ls, *hs, *lo, *hi, *b;
    size_t copied = 0;
    size_t mid;
    
    mid = MID (low, high);
    
    if (mid + 1 < high)
        msort (array, buf, mid + 1, high, size, compare);
    
    if (mid > low)
        msort (array, buf, low, mid, size, compare);
    
    ah = ((char *) array) + ((high + 1) * size);
    am = ((char *) array) + ((mid + 1) * size);
    a1 = al = ((char *) array) + (low * size);
    
    b = (char *) buf;
    lo = al;
    hi = am;
    
    do {
        ls = lo;
        hs = hi;
        
        if (lo > al || hi > am) {
            /* our last loop already compared lo & hi and found lo <= hi */
            lo += size;
        }
        
        while (lo < am && compare (lo, hi) <= 0)
            lo += size;
        
        if (lo < am) {
            if (copied == 0) {
                /* avoid copying the leading items */
                a1 = lo;
                ls = lo;
            }
            
            /* our last compare tells us hi < lo */
            hi += size;
            
            while (hi < ah && compare (hi, lo) < 0)
                hi += size;
            
            if (lo > ls) {
                MEMCOPY (b, ls, lo - ls);
                copied += (lo - ls);
                b += (lo - ls);
            }
            
            MEMCOPY (b, hs, hi - hs);
            copied += (hi - hs);
            b += (hi - hs);
        } else if (copied) {
            MEMCOPY (b, ls, lo - ls);
            copied += (lo - ls);
            b += (lo - ls);
            
            /* copy everything we needed to re-order back into array */
            MEMCOPY (a1, buf, copied);
            return;
        } else {
            /* everything already in order */
            return;
        }
    } while (hi < ah);
    
    if (lo < am) {
        MEMCOPY (b, lo, am - lo);
        copied += (am - lo);
    }
    
    MEMCOPY (a1, buf, copied);
}

int
MergeSort (void *base, size_t nmemb, size_t size,
           int (* compare) (const void *, const void *))
{
    int copy_mode;
    void *tmp;
    
    if (nmemb < 2)
        return 0;
    
    if (!(tmp = malloc (nmemb * size))) {
        errno = ENOMEM;
        return -1;
    }
    
    if ((((char *) base) - ((char *) 0)) % sizeof (long) == 0 && (size % sizeof (long)) == 0)
        copy_mode = 1;
    else if ((((char *) base) - ((char *) 0)) % sizeof (int) == 0 && (size % sizeof (int)) == 0)
        copy_mode = 2;
    else
        copy_mode = 0;
    
    msort (base, tmp, 0, nmemb - 1, size, copy_mode, compare);
    
    free (tmp);
    
    return 0;
}

This handy trick seems to have worked out rather well:

Randomized input: 7.79s
Sorted input: 0.99s
Reversed input: 1.69s

At this point, I can't think of any other obvious optimizations so I'm going to call it a day.

For a recap, here are the results of all 4 implementations compared side-by-side with the results from qsort():

	qsort()	msort() v1	msort() v2	msort() v3	msort() v4
random:	6.77	14.13	10.45	9.80	7.79
sorted:	1.54	4.41	2.08	0.95	0.99
reversed:	1.90	4.26	2.03	2.05	1.69

Saturday, February 28, 2009

Text Layout Engines

As many of my loyal followers know, I wrote a really fast text layout engine for Moonlight 1.0 which was able to layout text in more-or-less a single pass over the string. Hard to do better than that, especially with my superbly (I'm allowed to stroke my own ego, right?) designed font/glyph caches.

That said, the code had also been superbly disgusting and unmaintainable. Made worse when I had to add hacks to render text selection (Silverlight 1.0's TextBlock is like a GtkLabel in that it just renders text, but Silverlight 2.0's TextBox supports editing and selection and so is therefor more akin to a multi-line GtkEntry widget).

Well, Thursday night, as I was watching House on Hulu, I had one of those "House moments" where he suddenly realizes what the patient is suffering from and how to solve the problem (usually when his friend, Wilson, is talking to him about something random).

I spent all day yesterday (and I mean all day, until 11pm last night) putting together my thoughts for a new design and working out the details and I think I now have a vastly improved solution that not only uses less memory in all but the pathological cases (I now use a UTF-8 string instead of a UCS4 string), but also doesn't require:

a pass over the text to break on CR/LF to construct a list of text runs which were what the old TextLayout engine I wrote used instead of a char* (because the layout engine now handles CR/LFs)
a whole new set of text runs every time selection boundaries change in a TextBox (because selection is no longer represented by text runs)

Of course, the same brilliance of the old design still apply: no need to re-layout when most text properties (underline, foreground, background, etc) change (obviously we still have to re-layout if font properties change because they change the metrics).

With my new design, my TextLayout class has a Select() method which allows the consumer to change the selected region of text. When you change the selection, my new logic can simply clear the cached glyph clusters for the affected area(s).

A "glyph cluster" is a cached (sub)run of glyphs in a particular text run. A "text run" is a substring of text that share all the same text attributes which does not span across line boundaries.

To break it down, a layout contains a list of lines. Each line contains a list of runs. Each run contains a list of glyph clusters.

Normally, a run will consist of only a single glyph cluster unless it overlaps the selection.

For example, if the first half of the run is within the selection, then the run will contain 2 glyph clusters (one for the selected portion and one for the non-selected portion). However, if the selection is fully contained within a single run but doesn't span the entire run, then it's possible to have up to 3 glyph clusters for that run: pre-selection, selection, post-selection.

The brilliance of doing it this way is that it simplifies keeping track of kerning between selected regions, so that as you drag your selection across some text, the text following your mouse cursor doesn't appear to "jump" to the left or right as you move between characters that are kerned.

Friday, October 31, 2008

Optimizing Moonlight's InkPresenter

This past week I started out staring at endless amounts of javascript trying to figure out what it was supposed to do so that I could figure out what Moonlight was breaking on in order to fix some bugs. As you can imagine, this is a slow and boring process where one's eyes go dry and it feels like you are getting nowhere fast.

As occasionally happens, I give up and move onto the next bug hoping that the next bug will be easier to fix and/or give me some insight into the previous bug. As it turned out, I moved onto bug #409793 which was a performance bug on sites like Ink Journal and Ink Tattoo Studio.

What was happening on these sites was that as more points got added to the stroke, rendering would get slower and slower (thus causing the line to lag behind the mouse cursor) because Moonlight was invalidating the entire InkPresenter canvas on each frame and so having to render the entire thing even though it was unnecessary.

To optimize this, I added a 'dirty' rectangle that kept track of the actual regions we needed to redraw. As points got added to the collection between frames, I added the bounds of the new point plus the region between the new point and the previous/next points (the 'next' point is obviously only needed if the newly added point was an insertion). The result for sites like InkJournal and InkTattooStudio was that we only invalidated the newly appended points at each frame render, vastly improving our performance which now matches Microsoft's Silverlight performance for these sites afaict.

Thursday, August 14, 2008

Moonlight Performance Enhancements

Spent today working with Sebastien "Master Profiler" Pouliot on finding and resolving one of our longest outstanding performance bottlenecks in Moonlight for which the GlyphMap Utility is a perfect test case.

Months ago, I had rewritten the font caching a bit so that we prevented (in most cases) the loading of the same font file that were shared between multiple textual XAML elements (Glyphs and TextBlock). At the time, font loading was at the top of the performance profile. While this fix did help a little as far as rendering speed went, it was barely noticeable visually. It wasn't a waste, however, because it greatly reduced memory overhead and later allowed for some better font scaling optimization tricks that I implemented.

Next up, I rewrote the Glyphs XAML element's ::ComputeBounds() routine such that not only did it calculate the rendering extents of the Glyphs element, but also cached the glyph string layout in a Cairo path such that later calls to ::Render() could simply blit the path rather than having to do its own layout.

Still, visual rendering performance seemed barely affected.

Then, today, Sebastien decided to take a look into the problem again and had discovered that sorting UIElements based on ZIndex was toward the top of the list as far as time-eaters went. The fix for this was to delay sorting the newly added UIElements until the engine went to render the next frame.

This bumped the sorting code right out of the time-eaters list but still no visual improvement :(

Sebastien reported back to me that we seemed to be idling between Glyphs draws which immediately made me recall that our Downloaders asyncronously download the referenced font files that each Glyphs element references and that at each render tick, we only popped a single request from our async queue.

I immediately went to work and fixed the async queue logic to pop as many frames as we could in 1/30th of a second (this value may need to be tweaked a slight bit, but it should typically be acceptable).

The GlyphMap table now renders instantly.

Saturday, June 14, 2008

Calculating the Nearest Power of 2

The typical implementation for finding the nearest power of 2 for a given value is as follows:

static uint32_t
nearest_pow (uint32_t num)
{
    uint32_t n = 1;

    while (n < num)
        n <<= 1;

    return n;
}

This implementation's performance, unfortunately, suffers as the value of num increases. Luckily there is another approach that takes a constant time no matter how large the value:

static uint32_t
nearest_pow (uint32_t num)
{
    uint32_t n = num > 0 ? num - 1 : 0;

    n |= n >> 1;
    n |= n >> 2;
    n |= n >> 4;
    n |= n >> 8;
    n |= n >> 16;
    n++;

    return n;
}

A simple performance test might be:

int main (int argc, char **argv)
{
    uint32_t i, n = 0;

    for (i = 0; i < INT_MAX / 10; i++)
        n += nearest_pow (i);

    return n > 0 ? 1 : 0;
}

The run-time difference between the two implementations on my AMD Athlon (/proc/cpuinfo reports AMD Athlon(TM) XP 3200+ @ 2200.141 MHz) is impressive. For performance testing, I compiled with gcc -O2 which I figure is the typical default for most packaged software on Linux distributions.

The brain-dead approach has the following results:

[fejj@serenity cvs]$ time ./pow

real 0m12.034s
user 0m11.809s
sys 0m0.032s

The bitwise implementation is insanely fast:

[fejj@serenity cvs]$ time ./pow2

real 0m1.361s
user 0m1.304s
sys 0m0.008s

Now... to be fair, the if you are using small values for num, then it's possible that the brain-dead approach might be faster. Let's try the same main() for-loop again, but this time let's request nearest_pow() with a value of 1 each time. Since it is likely that the results will be far too fast to really compare, let's also bump up the number of iterations to UINT_MAX.

[fejj@serenity cvs]$ time ./pow

real 0m0.003s
user 0m0.000s
sys 0m0.004s
[fejj@serenity cvs]$ time ./pow2

real 0m0.002s
user 0m0.000s
sys 0m0.000s

Unfortunately, both are still far too fast to really compare performance. Let's try bumping up the value of num to see if we can find the point at which the while-loop approach starts to fall behind the bitwise approach. To start, let's try passing the value of 2 as the num argument:

[fejj@serenity cvs]$ time ./pow

real 0m0.002s
user 0m0.000s
sys 0m0.004s
[fejj@serenity cvs]$ time ./pow2

real 0m0.002s
user 0m0.000s
sys 0m0.000s

It looks like the bitwise approach may be faster than the while-loop approach for the value of 2, but it's a bit hard to tell for sure with only UINT_MAX loops. We'd have to switch to using a 64bit i to know for sure and I'm not sure it's that important. Let's try 3 and see what we get:

[fejj@serenity cvs]$ time ./pow

real 0m6.053s
user 0m5.968s
sys 0m0.004s
[fejj@serenity cvs]$ time ./pow2

real 0m0.003s
user 0m0.000s
sys 0m0.004s

Well, hot diggity... I think we have ourselves a winner. This suggests that for all values of num larger than 2, the performance of the while-loop approach will be outmatched by the bitwise approach and that for values less-than-or-equal to 2, the performance is nearly identical.

Update: Thanks to the anonymous commenter that noticed that my original main() program was allowing the compiler to optimize out the call to nearest_pow() in the bitwise implementation. As suggested, I updated the for-loop to accumulate the output and then used it after the loop to avoid this problem. It only seemed to change the results for the bitwise implementation in the first test, however (before the change, it reported 0.002s). Still, on my machine it is approx. 10x faster for the first test case and seems to lose no performance even in the optimal conditions for the while-loop implementation.

Update2: I was just pointed to the Linux kernel's fls() implementation for x86. Here is a new implementation using inline assembler for x86:

static uint32_t
nearest_pow (uint32_t num)
{
    int bit;

    __asm__("bsrl %1,%0\n\t"
            "jnz 1f\n\t"
            "movl $-1,%0\n"
            "1:" : "=r" (bit) : "rm" (num));

    return (1 << (bit + 1));
}

The results for the original INT_MAX / 10 iterations using i as the num argument yields the following results:

[fejj@serenity cvs]$ time ./pow3

real 0m1.335s
user 0m1.296s
sys 0m0.004s

The results seem negligibly faster than the C bitwise implementation and obviously less portable :(

Update3: A friend of mine, Stephane Delcroix, has just pointed me at a solution to this problem that he came up the other day:

static uint32_t
nearest_pow (uint32_t num)
{
    uint32_t j, k;
    (j = num & 0xFFFF0000) || (j = num);
    (k = j & 0xFF00FF00) || (k = j);
    (j = k & 0xF0F0F0F0) || (j = k);
    (k = j & 0xCCCCCCCC) || (k = j);
    (j = k & 0xAAAAAAAA) || (j = k);
    return j << 1;
}

The results of this implementation are as follows:

[fejj@serenity cvs]$ time ./pow4

real 0m1.249s
user 0m1.204s
sys 0m0.004s

This is actually faster than both the bitwise and the assembler implementations above!

There are two things to be aware of, though:

When num is 0, the value of 0 is returned (which may not be desirable depending on what you are doing with it)
If num is a power of 2, then instead of returning num, this implementation will return the next higher power of 2

Wednesday, February 6, 2008

Optimizing GMime's UUEncoder

This past weekend I was talking with Andreia about how Pan is built on top of GMime and takes advantage of my awesomely speedy uuencode/uudecode routines which reminded me that I had done some performance comparisons of GMime's uuencode program vs. the one in the GNU sharutils package a number of years ago.

I had compared GMime 1.90.0 (which was a pre-release of GMime 2.0) and GNU sharutils 4.2.0 and the results were pretty impressive... GMime's uuencoder was on the order of 3 times faster than the one in sharutils and produced exactly the same results.

The uudecoder and the base64 encoder/decoder were all roughly on the order of 7 times faster than those in GNU sharutils, so all around GMime outperformed GNU sharutils by quite a bit.

Anyways, re-reading my test results got me thinking that my uuencode routines could probably be optimized a bit more as they were lagging a bit behind the base64 encoder routine and there's really no reason it should be that far off.

Well, tonight I finally got off my butt and decided to take a look and figure out why. Upon scrolling down to my uuencode_step() routine, I immediately saw why:

Each loop would collect up to 3 bytes from the input and bit shift them into a 32bit 'saved' variable (which is a state variable used for incremental uuencoding an input stream). Then, if I had successfully extracted 3 bytes from the input, I would extract them out of 'saved' into 3 unsigned char variables. At this point I would then encode them into a temporary output buffer. When this output buffer ('uubuf') grew to 60 bytes, I'd flush it to the real output buffer with a memcpy().

All of this extra copying of data around adds up after a while and really begins to impact performance.

Before making any changes, I timed how long it took the original version of my uuencode_step() function to encode linux-2.6.24.tar.gz on my system[1]. An average result over numerous runs was as follows:

[fejj@localhost ~]$ time `gmime-uuencode linux-2.6.24.tar.gz linux-2.6.24.tar.gz > /dev/null`
real    0m0.470s
user    0m0.412s
sys     0m0.052s

After my rewrite, my new results were closer to:

[fejj@localhost ~]$ time `gmime-uuencode linux-2.6.24.tar.gz linux-2.6.24.tar.gz > /dev/null`
real    0m0.291s
user    0m0.252s
sys     0m0.024s

For the sake of comparison, the best time I could manage to get from GNU sharutils 4.6.2 was as follows:

[fejj@localhost ~]$ time `uuencode linux-2.6.24.tar.gz linux-2.6.24.tar.gz > /dev/null`
real    0m1.386s
user    0m1.276s
sys     0m0.092s

The new implementation of uuencode_step() in gmime/gmime-utils.c has been committed to the gmime svn module on GNOME's subversion server, revision 1216 - this change should appear in the next release of GMime which will likely be 2.2.17.

Notes:

1. The system I tested this on was my Lenovo T61 laptop w/ a 7200 RPM harddrive running OpenSuSE 10.3 with various updates. The kernel was version 2.6.22.13-0.3-bigsmp.

From /proc/cpuinfo:

model name : Intel(R) Core(TM)2 Duo CPU T7700 @ 2.40GHz
cpu MHz : 800.000

(e.g. my cpu was scaled down at the time of testing)

2. The GMime uuencode implementation uses a GMimeStreamFs for input as well as output. This stream class is a wrapper around the POSIX I/O system functions which unfortunately has a sub-optimal need to perform an lseek() before each read() or write() call in order to make sure that the underlying file descriptor is in the expected position. This is necessary because it is possible for multiple streams to re-use the same fd.

I mention this because an obvious rebuttal to GMime's superior performance might be to suspect that GMime's uuencode implementation "cheated" by using an mmap()'d input buffer where the GNU sharutils implementation might not.

Friday, September 28, 2007

Text Rendering

Moonlight has some fairly unique text rendering requirements that I've not seen done anywhere else in Linux, Gtk+ application or no. Most applications stick to rendering text horizontally or vertically, they rarely, if ever, perform unusual matrix transformations on text (e.g. rotations).

When I first implemented text rendering for Moonlight, I used the obvious choice: Pango, using the Cairo backend (since we're using cairo in Moonlight for 2D graphics rendering anyway).

Unfortunately, we ran into some problems...

The first major problem is that as we applied rotation transforms and called pango_cairo_show_layout(), we'd get rendering glitches in that each glyph seemed to have its own independent baseline and so each frame, glyphs appeared to jitter.

The second major problem was that calling pango_cairo_show_layout() each frame had some major performance problems.

Thirdly, there appears to be no way to tell pango to load a font face from a specific file path.

As far as the rendering performance issue, we considered caching the layout path but my discussion with Owen suggested that it would gain us nothing. That, plus the perceived difficulty of doing this (since we may have to change brushes mid-stroke), shied me away from bothering to try.

These problems led me to consider implementing our own text layout/rendering engine to see if we could solve the above problems since the pango maintainers didn't seem to know what the problems could be offhand and thus had no suggestions for us.

At first, my text layout/rendering engine only handled rendering of glyphs via bitmaps, but even so, the result was that this new layout/rendering engine was quite a bit faster than pango.

Seeing this, Chris Toshok, Larry Ewing and I started digging into pango text rendering performance problems a bit more, not quite willing to give up on pango.

Toshok noticed that pango was loading the same font over and over again each frame, so started digging into that aspect a bit and came up with a patch to pango to fix a bug where it used the entire transform matrix as part of the hash instead of just the scaling component (which is all that was needed for uniqueness).

For one of the very simple text rendering test cases we had (the text "Moonlight in 21 Days" spinning and resizing via cairo matrix transforms), Toshok's patch nearly doubled the speed of pango rendering from something like 20 to 40fps (40fps was our cap, so it may have even rendered faster).

Meanwhile, I began looking into that cairo path caching idea and discovered it wasn't nearly as complicated to implement as I had originally feared. The results were just as amazing, again doubling the performance or better, altho this was before I had applied Toshok's patch (so don't get the idea that my patch + Toshok's patch = 4x speed improvement).

Not only did my patch make a huge performance improvement, it also got rid of the glyph jittering.

Unfortunately, this still left us with problem #3 as well as a few other problems regarding layout dissimilarities between pango and Microsoft's text layout in Silverlight, so for now, it seems I needed to go back to my own text layout/rendering engine.

Once I had finished adding support for rendering glyph paths, I implemented a similar cairo_path_t caching hack for my own text rendering engine and made it possible to choose which text layout/rendering engine to use at runtime via an environment variable.

Out of curiosity, I decided to compare performance of my own text layout/rendering engine vs pango on a test case I had of several "Hello" strings each having different combinations of matrix transforms applied to them in an ongoing animation. One of the "Hello" strings was simply undergoing FontSize changes which cause each of the text layout engines to have to recalculate the layout (wrapping, etc).

The performance difference was shocking... the pango implementation (which doesn't even render the underline for one of the text strings due to a bug in my cairo_path_t caching hack? If anyone has any suggestions on how to fix this in mango.cpp, don't hesitate to poke me) only gets about 23fps while my home-rolled implementation gets 45fps.

It might be possible to improve the performance of my home-rolled implementation if I were to fix my code to use a true FT_Face LRU cache... right now it simply keeps a hash of loaded FT_Faces with a ref_count, when that ref_count hits 0, it gets removed from the hash table. This means that each frame it has to load a new FT_Face from FontConfig because the FontSize attribute changes and since it was had the only ref on that particular FT_Face, it goes away and has to be reloaded again next time it changes back to that size. Oh, and the 45 fps was with debug spew turned on showing me whenever a new font got loaded - so turning off that printf() would probably bump me up to 50fps (which is the new fps cap on my machine).

As a further test, I removed the "Hello" TextBlock that had the FontSize attribute changes each frame. The result was that both pango and my own text layout/rendering engines jumped to ~50fps.

This suggests pango's layout calculation is where the performance bottleneck is.

I guess I'll have to dig into this problem some more later... Or, even better, maybe one of the pango developers can take a peek at Moonlight's font.cpp and see if they can maybe glean some ideas from there that they can apply to pango :)

Monday, April 16, 2007

Childhood Memories

Today, Miguel asked me to take a look at fixing some of the System.Console 2.0 bugs. I managed to fix some of the ReadKey() and ReadLine() bugs, although the Backspace bug illustrated in Iron Python is still out there (it appears that the cursor X,Y position is not correctly kept track of in Mono).

Earlier today, when Miguel was explaining to me what sorts of problems existed in System.Console, one of the things he was hoping I could take a look at (although there was no bug # that I know of?) was optimizing Console.Write[Line]().

I didn't actually have time to look at that until tonight when I was sitting at my home computer waiting for my dinner to finish cooking. The solution was fairly simple (for those interested in seeing the patch, check out revision 75806).

An important part of optimizing a section of code is to compare actual running times of the old code vs the new code (and, obviously, to check that the results are correct), so I wrote a simple program that could be measured for performance improvements:

using System;

public class Program {
    static void Main () {
        Console.ForegroundColor = ConsoleColor.Cyan;
        string abc = new String ('a', 100000);
        Console.WriteLine (abc);
    }
}

Using the system `time' command, I got the following times:

pre optimization:

real 0m13.177s
user 0m0.308s
sys 0m0.232s

post optimization:

real 0m0.238s
user 0m0.124s
sys 0m0.004s

Wowzers!

Anyways, the reason the title of this blog is "Childhood Memories" is because after running this test program, I couldn't help but remember my first assembler program for the 6502 that I wrote back when I was 8 years old - it was a program which filled the screen with the character 'A' (which I then compared to a program written in BASIC that did the same thing). The difference in speed here was about the same as it was back then, too, funnily enough :)

Update: This morning I woke up realizing that I had a bug in my optimization patch last night, but I had a fix that lost no performance and then later today (in part thanks to Alan's prompting me to re-evaluate my need for a temporary buffer, which truly became unnecessary after my fix this morning) was able to optimize it even further (and eliminated the need for a temporary buffer) by blitting chunks of the input buffer between special escape sequences at a time (we have to handle certain escape sequences specially as they can relocate the cursor).

Oh, and the Iron Python bug is now solved... it was actually arguably a bug in Iron Python in that it goes behind Mono's back when writing to stdout so it was impossible for us to keep track of the cursor position. They do, however, use Console.In.ReadLine() (would be better if they simply used Console.ReadLine() but I digress), and so what I did was make ReadLine() query the terminal for the cursor position (rather than rely on our own state).

A Moment of Zen