Showing posts with label gmime. Show all posts

Sunday, April 9, 2017

GMime 2.99.0 released

After a long hiatus, I am pleased to announce the release of GMime 2.99.0!

See below for a list of new features and bug fixes.

About GMime

GMime is a C library which may be used for the creation and parsing of messages using the Multipurpose Internet Mail Extension (MIME), as defined by numerous IETF specifications.

GMime features an extremely robust high-performance parser designed to be able to preserve byte-for-byte information allowing developers to re-seralize the parsed messages back to a stream exactly as the parser found them. It also features integrated GnuPG and S/MIME v3.2 support.

Built on top of GObject (the object system used by the GNOME desktop), many developers should find its API design and memory management very familiar.

Noteworthy changes in version 2.99.0

Overhauled the GnuPG support to use GPGME under the hood rather than a custom wrapper.
Added S/MIME support, also thanks to GPGME.
Added International Domain Name support via GNU's libidn.
Improved the GMimeMessage APIs for accessing the common address headers. They now all return an InternetAddressList.
g_mime_init() no longer takes any flag arguments and the g_mime_set_user_charsets() API has also been dropped. Instead, GMimeParserOptions and GMimeFormatOptions have taken the place of these APIs to allow customization of various parser and formatting options in a much cleaner way. To facilitate this, many parsing functions and formatting functions have changed to now take these options arguments.
InternetAddress now has a 'charset' property that can be set to override GMime's auto-detection of the best charset to use when encoding names.
GMimeHeaderIter has been dropped in favor of a much simpler index-based API on GMimeHeaderList.
GMimeHeaderList no longer caches the raw message/mime headers in a stream. Instead, each GMimeHeader now has its own cache. This means that changing the GMimeHeaderList or any of its GMimeHeaders no longer invalidates the entire cache.
GMimeParser has been fixed to preserve (munged or otherwise) From-lines that sometimes appear at the start of the content of message/rfc822 parts.
GMimeParser now also scans for encapsulated PGP blocks within MIME parts as it is parsing them and sets a flag on each GMimePart that contains one of these blocks.
GMimePart now has APIs for dealing with said encapsulated PGP blocks.

Developers interested in migrating to the upcoming GMime 3.0 API (of which GMime 2.99.0 is a preview) should take a look at the PORTING document included with the source code as it contains a fairly comprehensive list of the API changes that they will need to be aware of.

Getting the Source Code

You can download official public release tarballs of GMime at https://download.gnome.org/sources/gmime/ or ftp://ftp.gnome.org/pub/GNOME/sources/gmime/.

If you would like to contribute to the GMime project, it is recommended that you grab the source code from the official GitHub repository at https://github.com/jstedfast/gmime. Cloning this repository can be done using the following command:

git clone https://github.com/jstedfast/gmime.git

Documentation

API reference documentation can be found at https://developer.gnome.org/gmime/2.99/.

Documentation for getting started can be found in the README.md.

Monday, March 10, 2014

GMime gets a Speed Boost

With all of the performance improvements I've been putting into MimeKit recently, it was about time to port some of these optimizations back to GMime.

In addition to other fixes that were in the queue, GMime 2.6.20 includes the "SIMD" optimization hack that I blogged about doing for MimeKit and I wanted to share the results. Below is a comparison of GMime 2.6.19 and 2.6.20 parsing the same 2GB mbox file on my 2011 Core-i5 iMac with the "persistent stream" option enabled on the GMimeParser:

[fejj@localhost gmime-2.6.19]$ ./gmime-mbox-parser really-big.mbox
Parsed 29792 messages in 5.15 seconds.

[fejj@localhost gmime-2.6.20]$ ./gmime-mbox-parser really-big.mbox
Parsed 29792 messages in 4.70 seconds.

That's a pretty respectable improvement. Interestingly, though, it's still not as fast as MimeKit utilizing Mono's LLVM backend:

[fejj@localhost MimeKit]$ mono --llvm ./mbox-parser.exe really-big.mbox
Parsed 29792 messages in 4.52 seconds.

Of course, to be fair, without the --llvm option, MimeKit doesn't fare quite so well:

[fejj@localhost MimeKit]$ mono ./mbox-parser.exe really-big.mbox
Parsed 29792 messages in 5.54 seconds.

I'm not sure what kind of optimizations LLVM utilizes when used from Mono vs clang (used to compile GMime via homebrew, which I suspect uses -O2), but nevertheless, it's still very impressive.

After talking with Rodrigo Kumpera from the Mono runtime team, it sounds like the --llvm option is essentially the -O2 optimizations minus a few of the options that cause problems with the Mono runtime, so effectively somewhere between -O1 and -O2.

I'd love to find out why MimeKit with the LLVM optimizer is faster than GMime compiled with clang (which also makes use of LLVM) with the same optimizations, but I think it'll be pretty hard to narrow down exactly because MimeKit isn't really a straight port of GMime (they are similar, but a lot of MimeKit is all-new in design and implementation).

Monday, February 3, 2014

Introducing MailKit, a cross-platform .NET mail-client library

Once I announced MimeKit, I knew it would only be a matter of time before I started getting asked about SMTP, IMAP, and/or POP3 support.

Let's just say,

Challenge... ACCEPTED!

I started off back in early December writing an SmtpClient so that developers using MimeKit wouldn't have to convert a MimeMessage to a System.Net.Mail.MailMessage in order to send it using System.Net.Mail.SmtpClient. This went pretty quickly because I've implemented several SMTP clients in the past. Implementing the various SASL authentication mechanisms probably took as much or more time than implementing the SMTP protocol.

The following weekend, I ended up implementing a Pop3Client. Originally, I had planned on more-or-less cloning the API we had used in Evolution, but I decided that I would take a different approach. I designed a simple IMessageSpool interface which more closely follows the limited functionality of POP3 and mbox spools instead of trying to map the Pop3Client to a Store/Folder paradigm like JavaMail and Evolution do (Evolution's mail library was loosely based on JavaMail). Mapping mbox and POP3 spools to Stores and Folders in Evolution was, to my recollection, rather awkward and I wanted to avoid that with MailKit.

At first I was loathe to do it, but over the past 2 weeks I ended up writing an ImapClient as well. I'm sure Philip van Hoof will be pleased to note that I have a very nice BODYSTRUCTURE parser, although that API is not publicly exported.

Unlike the SmtpClient and Pop3Client, the ImapClient does not have all of its functionality on a single public class. Instead, ImapClient implements an IMessageStore which has a limited API, mostly meant for getting IFolders. I imagine that those who are familiar with the JavaMail and/or Evolution (Camel) APIs will recognize this design.

The IFolder interface isn't designed to be exactly like the JavaMail Folder API, though. I've been designing the interface incrementally as I implement the various IMAP extensions (I've found at least 37 of them at the time of this blog post, although I don't think I'll bother with ACL, MAILBOX-REFERRAL, or LOGIN-REFERRAL), so the API may continue to evolve as I go, but I think what I've got now will likely remain - I'll probably just be including additional APIs for the new stuff.

So far, I've implemented the following IMAP extensions: LITERAL+, NAMESPACE, CHILDREN, LOGIN-DISABLED, STARTTLS, MULTIAPPEND, UNSELECT, UIDPLUS, CONDSTORE, ESEARCH, SASL-IR, SORT, THREAD, SPECIAL-USE, MOVE, XLIST, and X-GM-EXT1. Phew, that was exhausting listing all of those!

Also news-worthy is that MimeKit is now equally as fast as GMime, which is pretty impressive considering that it is fully managed C# code.

Download MailKit 0.2 now and let the hacking begin!

Monday, October 7, 2013

Optimization Tips & Tricks used by MimeKit: Part 2

In my previous blog post, I talked about optimizing the most critical loop in MimeKit's MimeParser by:

Extending our read buffer by an extra byte (which later became 4 extra bytes) that I could set to '\n', allowing me to do the bounds check after the loop as opposed to in the loop, saving us roughly half the instructions.
Unrolling the loop in order to check for 4 bytes at a time for that '\n' by using some bit twiddling hacks (for 64-bit systems, we might gain a little more performance by checking 8 bytes at a time).

After implementing both of those optimizations, the time taken for MimeKit's parser to parse nearly 15,000 messages in a ~1.2 gigabyte mbox file dropped from around 10s to about 6s on my iMac with Mono 3.2.3 (32-bit). That is a massive increase in performance.

Even after both of those optimizations, that loop is still the most critical loop in the parser and the MimeParser.ScanContent() method, which contains it, is still the most critical method of the parser.

While the loop itself was a huge chunk of the time spent in that method, the next largest offender was writing the content of the MIME part into a System.IO.MemoryStream.

MemoryStream, for those that aren't familiar with C#, is just what it sounds like it is: a stream backed by a memory buffer (in C#, this happens to be a byte array). By default, a new MemoryStream starts with a buffer of about 256 bytes. As you write more to the MemoryStream, it resizes its internal memory buffer to either the minimum size needed to hold the its existing content plus whatever number of bytes your latest Write() was called with or double the current internal buffer size, whichever is larger.

The performance problem here is that for MIME parts with large amounts of content, that buffer will be resized numerous times. Each time that buffer is resized, due to the way C# works, it will allocate a new buffer, zero the memory, and then copy the old content over to the new buffer. That's a lot of copying and creates a situation where the write operation can become exponentially worse as the internal buffer gets larger. Since MemoryStream contains a GetBuffer() method, its internal buffer really has to be a single contiguous block of memory. This means that there's little we could do to reduce overhead of zeroing the new buffer every time it resizes beyond trying to come up with a different formula for calculating the next optimal buffer size.

At first I decided to try the simple approach of using the MemoryStream constructor that allows specifying an initial capacity. By bumping up the initial capacity to 2048 bytes, things did improve, but only by a very disappointing amount. Larger initial capacities such as 4096 and 8192 bytes also made very little difference.

After brainstorming with my coworker and Mono runtime hacker, Rodrigo Kumpera, we decided that one way to solve this performance problem would be to write a custom memory-backed stream that didn't use a single contiguous block of memory, but instead used a list of non-contiguous memory blocks. When this stream needed to grow its internal memory storage, all it would need to do is allocate a new block of memory and append it to its internal list of blocks. This would allow for minimal overhead because only the new block would need to be zeroed and no data would need to be re-copied, ever. As it turns out, this approach would also allow me to limit the amount of unused memory used by the stream.

I dubbed this new memory-backed stream MimeKit.IO.MemoryBlockStream. As you can see, the implementation is pretty trivial (doesn't even require scary looking bit twiddling hacks like my previous optimization), but it made quite a difference in performance. By using this new memory stream, I was able to shave a full second off of the time needed to parse that mbox file I mentioned earlier, getting the total time spent down to 5s. That's starting to get pretty respectable, performance-wise.

As a comparison, let's compare the performance of MimeKit with what seems to be the 2 most popular C# MIME parsers out there (OpenPOP.NET and SharpMimeTools) and see how we do. I've been hyping up the performance of MimeKit a lot, so it had better live up to expectations, right? Let's see if it does.

Now, since none of the other C# MIME parsers I could find support parsing the Unix mbox file format, we'll write some test programs to parse the same message stream over and over (say, 20 thousand times) to compare MimeKit to OpenPOP.NET.

Here's the test program I wrote for OpenPOP.NET:

using System;
using System.IO;
using System.Diagnostics;
using OpenPop.Mime;

namespace OpenPopParser {
    class Program
    {
        public static void Main (string[] args)
        {
            var stream = File.OpenRead (args[0]);
            var stopwatch = new Stopwatch ();

            stopwatch.Start ();
            for (int i = 0; i < 20000; i++) {
                var message = Message.Load (stream);
                stream.Position = 0;
            }
            stopwatch.Stop ();

            Console.WriteLine ("Parsed 20,000 messages in {0}", stopwatch.Elapsed);
        }
    }
}

Here's the SharpMimeTools parser I wrote for testing:

using System;
using System.IO;
using System.Diagnostics;
using anmar.SharpMimeTools;

namespace SharpMimeParser {
    class Program
    {
        public static void Main (string[] args)
        {
            var stream = File.OpenRead (args[0]);
            var stopwatch = new Stopwatch ();

            stopwatch.Start ();
            for (int i = 0; i < 20000; i++) {
                var message = new SharpMessage (stream);
                stream.Position = 0;
            }
            stopwatch.Stop ();

            Console.WriteLine ("Parsed 20,000 messages in {0}", stopwatch.Elapsed);
        }
    }
}

And here is the test program I used for MimeKit:

using System;
using System.IO;
using System.Diagnostics;
using MimeKit;

namespace MimeKitParser {
    class Program
    {
        public static void Main (string[] args)
        {
            var stream = File.OpenRead (args[0]);
            var stopwatch = new Stopwatch ();

            stopwatch.Start ();
            for (int i = 0; i < 20000; i++) {
                var parser = new MimeParser (stream, MimeFormat.Default);
                var message = parser.ParseMessage ();
                stream.Position = 0;
            }
            stopwatch.Stop ();

            Console.WriteLine ("Parsed 20,000 messages in {0}", stopwatch.Elapsed);
        }
    }
}

Note: Unfortunately, OpenPOP.NET's message parser completely failed to parse the Star Trek message I pulled out of my test suite at random (first message in the jwz.mbox.txt file included in MimeKit's UnitTests project) due to the Base64 decoder not liking some byte or another in the stream, so I had to patch OpenPOP.NET to no-op its base64 decoder (which, if anything, should make it faster).

And here are the results running on my 2011 MacBook Air:

[fejj@localhost OpenPopParser]$ mono ./OpenPopParser.exe ~/Projects/MimeKit/startrek.msg
Parsed 20,000 messages in 00:06:26.6825190

[fejj@localhost SharpMimeParser]$ mono ./SharpMimeParser.exe ~/Projects/MimeKit/startrek.msg
Parsed 20,000 messages in 00:19:30.0402064

[fejj@localhost MimeKit]$ mono ./MimeKitParser.exe ~/Projects/MimeKit/startrek.msg
Parsed 20,000 messages in 00:00:15.6159326

Whooooosh!

Not. Even. Close.

MimeKit is nearly 25x faster than OpenPOP.NET even after making its base64 decoder a no-op and 75x faster than SharpMimeTools.

Since I've been ranting against C# MIME parsers that made heavy use of regex, let me show you just how horrible regex is for parsing messages (performance-wise). There's a C# MIME parser called MIMER that is nearly pure regex, so what better library to illustrate my point? I wrote a very similar loop to the other 2 that I listed above, so I'm not going to bother repeating it again. Instead, I'll just skip to the results:

[fejj@localhost MimerParser]$ mono ./MimerParser.exe ~/Projects/MimeKit/startrek.msg
Parsed 20,000 messages in 00:16:51.4839129

Ouch. MimeKit is roughly 65x faster than a fully regex-based MIME parser. It's actually rather pathetic that this regex parser beats SharpMimeTools.

This is why, as a developer, it's important to understand the limitations of the tools you decide to use. Regex is great for some things but it is a terrible choice for others. As Jamie Zawinski might say,

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

Monday, September 30, 2013

Optimization Tips & Tricks used by MimeKit: Part 1

One of the goals of MimeKit, other than being the most robust MIME parser, is to be the fastest C# MIME parser this side of the Mississippi. Scratch that, fastest C# MIME parser in the World.

Seriously, though, I want to get MimeKit to be as fast and efficient as my C parser, GMime, which is one of the fastest (if not the fastest) MIME parsers out there right now, and I don't expect that any parser is likely to smoke GMime anytime soon, so using it as a baseline to compare against means that I have a realistic goal to set for MimeKit.

Now that you know the why, let's examine the how.

First, I'm using one of those rarely used features of C#: unsafe pointers. While that alone is not all that interesting, it's a corner stone for one of the main techniques I've used. In C#, the fixed statement (which is how you get a pointer to a managed object) pins the object to a fixed location in memory to prevent the GC from moving that memory around while you operate on that buffer. Keep in mind, though, that telling the GC to pin a block of memory is not free, so you should not use this feature without careful consideration. If you're not careful, using pointers could actually make your code slower. Now that we've got that out of the way...

MIME is line-based, so a large part of every MIME parser is going to be searching for the next line of input. One of the reasons most MIME parsers (especially C# MIME parsers) are so slow is because they use a ReadLine() approach and most TextReaders likely use a naive algorithm for finding the end of the current line (as well as all of the extra allocating and copying into a string buffer):

    // scan for the end of the line
    while (inptr < inend && *inptr != (byte) '\n')
        inptr++;

The trick I used in GMime was to make sure that my read buffer was 1 byte larger than the max number of bytes I'd ever read from the underlying stream at a given time. This allowed me to set the first byte in the buffer beyond the bytes I just read from the stream to '\n', thus allowing for the ability to remove the inptr < inend check, opting to do the bounds check after the loop has completed instead. This nearly halves the number of instructions used per loop, making it much, much faster. So, now we have:

    // scan for the end of the line
    while (*inptr != (byte) '\n')
        inptr++;

But is that the best we can do?

Even after using this trick, it was still the hottest loop in my parser:

We've got no choice but to use a linear scan, but that doesn't mean that we can't do it faster. If we could somehow reduce the number of loops and likewise reduce the number of pointer increments, we could eliminate a bunch of the overhead of the loop. This technique is referred to as loop unrolling. Here's what brianonymous (from the ##csharp irc channel on freenode) and I came up with (with a little help from Sean Eron Anderson's bit twiddling hacks):

    uint* dword = (uint*) inptr;
    uint mask;

    do {
        mask = *dword++ ^ 0x0A0A0A0A;
        mask = ((mask - 0x01010101) & (~mask & 0x80808080));
    } while (mask == 0);

And here are the results of that optimization:

Now, keep in mind that on many architectures other than x86, in order to employ the trick above, inptr must first be 4-byte aligned (uint is 32bit) or it could cause a SIGBUS or worse, a crash. This is fairly easy to solve, though. All you need to do is increment inptr until you know that it is 4 byte aligned and then you can switch over to reading 4 bytes at a time as in the above loop. We'll also need to figure out which of those 4 bytes contained the '\n'. An easy way to solve that problem is to just linearly scan those 4 bytes using our previous single-byte-per-loop implementation starting at dword - 1. Here it is, your moment of Zen:

    // Note: we can always depend on byte[] arrays being
    // 4-byte aligned on 32bit and 64bit architectures
    int alignment = (inputIndex + 3) & ~3;
    byte* aligned = inptr + alignment;
    byte* start = inptr;
    uint mask;

    while (inptr < aligned && *inptr != (byte) '\n')
        inptr++;

    if (inptr == aligned) {
        // -funroll-loops
        uint* dword = (uint*) inptr;

        do {
            mask = *dword++ ^ 0x0A0A0A0A;
            mask = ((mask - 0x01010101) & (~mask & 0x80808080));
        } while (mask == 0);

        inptr = (byte*) (dword - 1);
        while (*inptr != (byte) '\n')
            inptr++;
    }

Note: In this above code snippet, 'inputIndex' is the byte offset of 'inptr' into the byte array. Since we can safely assume that index 0 is 4-byte aligned, we can do a simple calculation to get the next multiple of 4 and add that to our 'inptr' to get the next 4-byte aligned pointer.

That's great, but what does all that hex mumbo jumbo do? And why does it work?

Let's go over this 1 step at a time...

    mask = *dword++ ^ 0x0A0A0A0A;

This xor's the value of dword with 0x0A0A0A0A (0x0A0A0A0A is just 4 bytes of '\n'). The xor sets every byte that is equal to 0x0A to 0 in mask. Every other byte will be non-zero.

    mask - 0x01010101

When we subtract 0x01010101 from mask, the result will be that only bytes greater than 0x80 will contain any high-order bits (and any byte that was originally 0x0A in our input will now be 0xFF).

    ~mask & 0x80808080

This inverts the value of mask resulting in no bytes having the highest bit set except for those that had a 0 in that slot before (including the byte we're looking for). By then bitewise-and'ing it with 0x80808080, we get 0x80 for each byte that was originally 0x0A in our input or otherwise had the highest bit set after the bit inversion.

Because there's no way for any byte to have the highest bit set in both sides of the encompassing bitwise-and except for the character we're looking for (0x0A), the mask will always be 0 unless any of the bytes within were originally 0x0A, which would then break us out of the loop.

Well, that concludes part 1 as it is time for me to go to bed so I can wake up at a reasonable time tomorrow morning.

Good night!

Saturday, September 28, 2013

MimeKit: Coming to a NuGet near you.

If, like me, you've been trapped in the invisible box of despair, bemoaning the woeful inadequacies of every .NET MIME library you've ever found on the internets, cry no more: MimeKit is here.

I've just released MimeKit v0.5 as a NuGet Package. There's still plenty of work left to do, mostly involving writing more API documentation, but I don't expect to change the API much between now and v1.0. For all the mobile MIME lovers out there, you'll be pleased to note that in addition to the .NET Framework 4.0 assembly, the NuGet package also includes assemblies built for Xamarin.Android and Xamarin.iOS. It's completely open source and licensed under the MIT/X11 license, so you can use it in any project you want - no restrictions. Once MimeKit goes v1.0, I plan on adding it to Xamarin's Component Store as well for even easier mobile development. If that doesn't turn that frown upside down, I don't know what will.

For those that don't already know, MimeKit is a really fast MIME parser that uses a real tokenizer instead of regular expressions and string.Split() to parse and decode headers. Among numerous other things, it can properly handle rfc2047 encoded-word tokens that contain quoted-printable and base64 payloads which have been improperly broken apart (i.e. a quoted-printable triplet or a base64 quartet is split between 2 or more encoded-word tokens) as well as handling cases where multibyte character sequences are split between words thanks to the state machine nature of MimeKit's rfc2047 text and phrase decoders (yes, there are 2 types of encoded-word tokens - something most other MIME parsers have failed to take notice of). With the use of MimeKit.ParserOptions, the user can specify his or her own fallback charset (in addition to UTF-8 and ISO-8859-1 that MimeKit has built in), allowing MimeKit to gracefully handle undeclared 8bit text in headers.

When constructing MIME messages, MimeKit provides the user with the ability to specify any character encoding available on the system for encoding each individual header (or, in the case of address headers: each individual email address). If none is specified, UTF-8 is used unless the characters will fit nicely into ISO-8859-1. MimeKit's rfc2047 and rfc2231 encoders do proper breaking of text (i.e it avoids breaking between surrogate pairs) before the actual encoding step, thus ensuring that each encoded-word token (or parameter value) is correctly self-contained.

S/MIME support is also available in the .NET Framework 4.0 assembly (not yet supported in the Android or iOS assemblies due to the System.Security assembly being unavailable on those platforms). MimeKit supports signing, encrypting, decrypting, and verifying S/MIME message parts. For signing, you can either use the preferred multipart/signed approach or the application/[x-]pkcs7-signature mime-type, whichever you prefer.

I'd love to support PGP/MIME as well, but this is a bit more complicated as I would likely need to depend on external native libraries and programs (such as GpgME and GnuPG) which means that MimeKit would likely have to become 32bit-only (currently, libgpgme is only available for 32bit Windows).

I hope you enjoy using MimeKit as much as I have enjoyed implementing it!

Note: For those using my GMime library, fear not! I have not forgotten about you! I plan to bring many of the API and parser improvements that I've made to MimeKit back to GMime in the near future.

For those using the C# bindings, I'd highly recommend that you consider switching to MimeKit instead. I've based MimeKit's API on my GMime API, so porting to MimeKit should be fairly straightforward.

Sunday, September 15, 2013

Time for a rant on mime parsers...

Warning: Viewer discretion is advised.

Where should I begin?

I guess I should start by saying that I am obsessed with MIME and, in particular, MIME parsers. No, really. I am obsessed. Don't believe me? I've written and/or worked on several MIME parsers at this point. It started off in my college days working on Spruce which had a horrendously bad MIME parser, and so as you read farther along in my rant about shitty MIME parsers, keep in mind: I've been there, I've written a shitty MIME parser.

As a handful of people are aware, I've recently started implementing a C# MIME parser called MimeKit. As I work on this, I've been searching around on GitHub and Google to see what other MIME parsers exist out there to find out what sort of APIs they provide. I thought perhaps I'll find one that offers a well-designed API that will inspire me. Perhaps, by some miracle, I'd find one that was actually pretty good that I could just contribute to instead of writing my own from scratch (yea, wishful thinking). Instead, all I have found are poorly designed and implemented MIME parsers, many probably belong on the front page of the Daily WTF.

I guess I'll start with some softballs.

First, there's the fact that every single one of them were written as System.String parsers. Don't be fooled by the ones claiming to be "stream parsers", because all any of those did was to slap a TextReader on top of the byte stream and start using reader.ReadLine(). What's so bad about that, you ask? For those not familiar with MIME, I'd like for you to take a look at the raw email sources in your inboxes particularly if you have correspondence with anyone outside of the US. Hopefully most of your friends and colleagues are using more-or-less MIME compliant email clients, but I guarantee you'll find at least a few emails with raw 8bit text.

Now, if the language they were using was C or C++, they might be able to get away with doing this because they'd technically be operating on byte arrays, but with Java and C#, a 'string' is a unicode string. Tell me: how does one get a unicode string from a raw byte array?

Bingo. You need to know the charset before you can convert those bytes into unicode characters.

To be fair, there's really no good way of handling raw 8bit text in message headers, but by using a TextReader approach, you are really limiting the possibilities.

Next up is the ReadLine() approach. One of the 2 early parsers in GMime (pan-mime-parser.c back in the version 0.7 days) used a ReadLine() approach, so I understand the thinking behind this. And really, there's nothing wrong with this approach as far as correctness goes, it's more of a "this can never be fast" complaint. Of the two early parsers in GMime, the pan-mime-parser.c backend was horribly slow compared to the in-memory parser. Of course, that's not very surprising. More surprising to me at the time was that when I wrote GMime's current generation of parser (sometime between v0.7 and v1.0), it was just as fast as the in-memory parser ever was and only ever had up to 4k in a read buffer at any given time. My point is, there are far better approaches than ReadLine() if you want your parser to be reasonably performant... and why wouldn't you want that? Your users definitely want that.

Okay, now come the more serious problems that I encountered in nearly all of the mime parser libraries I found.

I think that every single mime parser I've found so far uses the "String.Split()" approach for parsing address headers and/or for parsing parameter lists on headers such as Content-Type and Content-Disposition.

Here's an example from one C# MIME parser:

string[] emails = addressHeader.Split(',');

Here's how this same parser decodes encoded-word tokens:

private static void DecodeHeaders(NameValueCollection headers)
{
    ArrayList tmpKeys = new ArrayList(headers.Keys);

    foreach (string key in headers.AllKeys)
    {
        //strip qp encoding information from the header if present
        headers[key] = Regex.Replace(headers[key].ToString(), @"=\?.*?\?Q\?(.*?)\?=",
            new MatchEvaluator(MyMatchEvaluator), RegexOptions.IgnoreCase | RegexOptions.Multiline);
        headers[key] = Regex.Replace(headers[key].ToString(), @"=\?.*?\?B\?(.*?)\?=",
            new MatchEvaluator(MyMatchEvaluatorBase64), RegexOptions.IgnoreCase | RegexOptions.Multiline);
    }
}

private static string MyMatchEvaluator(Match m)
{
    return DecodeQP(m.Groups[1].Value);
}

private static string MyMatchEvaluatorBase64(Match m)
{
    System.Text.Encoding enc = System.Text.Encoding.UTF7;
    return enc.GetString(Convert.FromBase64String(m.Groups[1].Value));
}

Excuse my language, but what the fuck? It completely throws away the charset in each of those encoded-word tokens. In the case of quoted-printable tokens, it assumes they are all ASCII (actually, latin1 may work as well?) and in the case of base64 encoded-word tokens, it assumes they are all in UTF-7!?!? Where in the world did he get that idea? I can't begin to imagine his code working on any base64 encoded-word tokens in the real world. If anything is deserving of a double facepalm, this is it.

I'd just like to point out that this is what this project's description states:

A small, efficient, and working mime parser library written in c#.
...
I've used several open source mime parsers before, but they all either
fail on one kind of encoding or the other, or miss some crucial
information. That's why I decided to finally have a go at the problem
myself.

I'll grant you that his MIME parser is small, but I'd have to take issue with the "efficient" and "working" adjectives. With the heavy use of string allocations and regex matching, it could hardly be considered "efficient". And as the code pointed out above illustrates, "working" is a bit of an overstatement.

Folks... this is what you get when you opt for a "lightweight" MIME parser because you think that parsers like GMime are "bloated".

On to parser #2... I like to call this the "Humpty Dumpty" approach:

public static StringDictionary parseHeaderFieldBody ( String field, String fieldbody ) {
    if ( fieldbody==null )
        return null;
    // FIXME: rewrite parseHeaderFieldBody to being regexp based.
    fieldbody = SharpMimeTools.uncommentString (fieldbody);
    StringDictionary fieldbodycol = new StringDictionary ();
    String[] words = fieldbody.Split(new Char[]{';'});
    if ( words.Length>0 ) {
        fieldbodycol.Add (field.ToLower(), words[0].ToLower().Trim());
        for (int i=1; i<words.Length; i++ ) {
            String[] param = words[i].Trim(new Char[]{' ', '\t'}).Split(new Char[]{'='}, 2);
            if ( param.Length==2 ) {
                param[0] = param[0].Trim(new Char[]{' ', '\t'});
                param[1] = param[1].Trim(new Char[]{' ', '\t'});
                if ( param[1].StartsWith("\"") && !param[1].EndsWith("\"")) {
                    do {
                        param[1] += ";" + words[++i];
                    } while ( !words[i].EndsWith("\"") && i<words.Length);
                }
                fieldbodycol.Add ( param[0], SharpMimeTools.parserfc2047Header (param[1].TrimEnd(';').Trim('\"', ' ')) );
            }
        }
    }
    return fieldbodycol;
}

I'll give this guy some credit, at least he saw that his String.Split() approach was flawed and so tried to compensate by piecing Humpty Dumpty back together again. Of course, with his String.Trim()ing, he just won't be able to put him back together again with any level of certainty. The white space in those quoted tokens may have significant meaning.

Many of the C# MIME parsers out there like to use Regex all over the place. Here's a snippet from one parser that is entirely written in Regex (yea, have fun maintaining that...):

if (m_EncodedWordPattern.RegularExpression.IsMatch(field.Body))
{
    string charset = m_CharsetPattern.RegularExpression.Match(field.Body).Value;
    string text = m_EncodedTextPattern.RegularExpression.Match(field.Body).Value;
    string encoding = m_EncodingPattern.RegularExpression.Match(field.Body).Value;

    Encoding enc = Encoding.GetEncoding(charset);

    byte[] bar;

    if (encoding.ToLower().Equals("q"))
    {
        bar = m_QPDecoder.Decode(ref text);
    }
    else
    {
        bar = m_B64decoder.Decode(ref text);
    }                    
    text = enc.GetString(bar);

    field.Body = Regex.Replace(field.Body,
        m_EncodedWordPattern.TextPattern, text);
    field.Body = field.Body.Replace('_', ' ');
}

Let's pretend that the regex pattern strings are correct in their definitions (because they are god-awful to read and I can't be bothered to double-check them), the replacing of '_' with a space is wrong (it should only be done in the "q" case) and the Regex.Replace() is just evil. Not to mention that there could be multiple encoded-words per field.Body which this code utterly fails to handle.

Guys. I know you love regular expressions and that they are very very useful, but they are no substitute for writing a real tokenizer. This is especially true if you want to be lenient in what you accept (and in the case of MIME, you really need to be).

Sunday, August 11, 2013

Why decoding rfc2047-encoded headers is hard

Somewhat inspired by a recent thread on the notmuch mailing-list, I thought I'd explain why decoding headers is so hard to get right. I'm sure just about every developer who has ever worked on an email client could tell you this, but I guess I'm going to be the one to do it.

Here's just a short list of the problems every developer faces when they go to implement a decoder for headers which have been (theoretically) encoded according to the rfc2047 specification:

First off, there are technically two variations of header encoding formats specified by rfc2047 - one for phrases and one for unstructured text fields. They are very similar but you can't use the same rules for tokenizing them. I mention this because it seems that most MIME parsers miss this very subtle distinction and so, as you might imagine, do most MIME generators. Hell, most MIME generators probably never even heard of specifications to begin with it seems.

This brings us to:
There are so many variations of how MIME headers fail to be tokenizable according to the rules of rfc2822 and rfc2047. You'll encounter fun stuff such as:
1. encoded-word tokens illegally being embedded in other word tokens
2. encoded-word tokens containing illegal characters in them (such as spaces, line breaks, and more) effectively making it so that a tokenizer can no longer, well, tokenize them (at least not easily)
3. multi-byte character sequences being split between multiple encoded-word tokens which means that it's not possible to decode said encoded-word tokens individually
4. the payloads of encoded-word tokens being split up into multiple encoded-word tokens, often splitting in a location which makes it impossible to decode the payload in isolation
You can see some examples here.
Something that many developers seem to miss is the fact that each encoded-word token is allowed to be in different character encodings (you might have one token in UTF-8, another in ISO-8859-1 and yet another in koi8-r). Normally, this would be no big deal because you'd just decode each payload, then convert from the specified charset into UTF-8 via iconv() or something. However, due to the fun brokenness that I mentioned above in (2c) and (2d), this becomes more complicated.

If that isn't enough to make you want to throw your hands up in the air and mutter some profanities, there's more...
Undeclared 8bit text in headers. Yep. Some mailers just didn't get the memo that they are supposed to encode non-ASCII text. So now you get to have the fun experience of mixing and matching undeclared 8bit text of God-only-knows what charset along with the content of (probably broken) encoded-words.

That said, I was able to help the notmuch developers solve this problem by letting them know about the GMIME_ENABLE_RFC2047_WORKAROUNDS flag that they could pass to g_mime_init(guint32 flags).

Any developer reading this blog post and thinking that they want to see how this is done in GMime, the source code for the rfc2047 decoder is located here. If the line numbers change in the future, just grep around for "rfc2047_token" and you should find it.

In other news... I cranked out a ton more code for MimeKit (my C# MIME parser library) yesterday. Yes, I know... I've got a serious problem with masochism having already written 2 MIME parsers and now I'm working on a third. When will the hurting stop? Never!

Oh, I guess I could point people at MimeKit's rfc2047 decoders. What you'll want to look at is MimeKit.Rfc2047.DecodePhrase(byte[] phrase) and MimeKit.Rfc2047.DecodeText(byte[] text).

Sunday, August 14, 2011

GMime 2.5.10: A Call For Testers

I've just released GMime 2.5.10 which I hope to be the last of the 2.5 releases before I release 2.6.0. I feel that I've stretched the development of 2.6.0 out for far too long (2.5 development began at the end of April, 2009) and even though I didn't get around to doing everything I had hoped to do, I feel that the latest 2.5.x releases are such an improvement over 2.4.x that I just want to get it out there for developers to start using. But before I make a 2.6.0 release, I'm hoping to get some feedback and some testing.

What's new?

New for the release of 2.5.10 is GMimePartIter which replaces the need for g_mime_object_foreach()and its awkward callback requirement, instead allowing you to take the far nicer iterator approach that is popular in the C# and Java worlds (known as IEnumerator in C#). This new iterator, like the foreach function it replaces, iterates over the MIME tree structure in depth-first order.

Inspired by IMAP's FETCH body part-specifier syntax, I've implemented a method allowing you to jump to a part based on a part-specifier string (aka a path): g_mime_part_iter_jump_to(). Also implemented is a function called g_mime_part_iter_get_path(), which can be used to tell you the current part-specifier path of the iterator.

For example, if you had the following MIME message structure:

multipart/related
  multipart/alternative
    text/plain
    text/html
  image/jpeg

The body part-specifier paths would be:

1    multipart/alternative
1.1  text/plain
1.2  text/html
2    image/jpeg

This means that g_mime_part_iter_jump_to(iter, "1.2") would jump to the part specified by the path "1.2" which, as we can see above, would be the text/html part. Calling g_mime_part_iter_next(iter) would iterate to the next part, being the image/jpeg, while calling g_mime_part_iter_prev(iter) would iterate backwards to the text/plain part and calling it again would iterate backwards to the multipart/alternative.

What I Need From Testers

My feeling is that developers will want to use this cool new body part-specifier path functionality for aiding them in implementing IMAP servers and/or clients. Because of this, it would be great if GMime's implementation matched IMAP's specification exactly. The problem is that I don't have the time or energy to verify that the paths work out to be identical in all cases. So... if you are one of those developers who is interested in using this functionality and need it to be identical to IMAP's syntax (or would really like it to be), I'm hoping that you could test it out and make sure that it matches. Especially worthwhile of testing, I'd imagine, is having message/rfc822 parts in the tree. I suspect that, if anywhere, this is where differences may be.

If body part-specifier paths aren't something you care about, don't fret; the rest of the iterator API needs testing as well and if you have no interest in the iterator API at all, perhaps you'd be willing to test the S/MIME functionality (especially since I haven't figured out how to test it myself, given that I don't have an S/MIME cert nor have I figured out how to generate one or add one to my gpgsm keyring).

Your help will be greatly appreciated.

Tuesday, April 28, 2009

Making GMime Even More Awesome

I've started working on GMime 2.6, which unfortunately will need to break API with 2.4 in order to achieve the next level of awesome.

If you use GMime (or are thinking of using it) in your project and have some suggestions on how GMime can improve, don't hesitate to fire an email off to gmime-devel-list@lists.gnome.org (preferably after subscribing to the mailing-list first).

These are my current plans:

Replace all uses of g_signals with my own event stuff. None of this needs to be public and my events are a lot more performant. [ DONE ]
Need to add a Changed event to GMimeHeaderList so that GMimeMessage can listen to changes in the toplevel mime_part's headers. When they change, we need to unset the cached header stream on the GMimeMessage. (see the "Note:" comments in message_write_to_stream and message_get_headers, while this hack works, it'd be nicer if we did it based on event callbacks)
Get rid of GMimeSession and replace it with GMimePassphraseRequestFunc or something. See GpgMe's passphrase request callback signature for ideas. [ DONE ]
Consider optionally using GpgMe so that we can support S/MIME?
Consider GCancellable and GError for GMimeStreams and GMimeParser
Add GIO-based GMimeStream and bump glib dep to 2.16 [ DONE ]
Add a g_mime_part_get_best_content_encoding()? [ DONE ]
Rename GMimeBestEncoding enum to GMimeEncodingConstraint? This might be a better name for the enum to reflect what it's actually meant for. Maybe also move it from gmime-filter-best.h to gmime-encodings.h?
How about a g_mime_part_get_best_charset()? This one could be awkward since it depends on the content being UTF-8 text
Should either rename g_mime_filter_best_encoding() to get_encoding() or else make sure that GMime.metadata 'fixes' the method name to be GetEncoding so that it will appear as a C# property getter.

Thursday, April 2, 2009

Building GMime in Visual Studio

Installing the Necessary Dependencies

First, install GNU's iconv library for Windows. You can get a nice msi installer from http://gnuwin32.sourceforge.net/packages/libiconv.htm. Unfortunately, they only offer a Win32 installer, so hopefully that's the platform you intend to target.

Next, you'll want to grab the GLib headers and libraries for Windows. The easiest way to do that is to go to http://www.gtk.org/download.html and download the All-in-One pre-built bundle for Win32. Once downloaded, extract the zip file (the docs suggest not using WinZip due to a bug) and place them wherever you want (I put mine into C:\Users\jeff\Documents).

Configuring Visual Studio

Now that the libiconv and glib headers/libraries are installed, you'll want to configure Visual Studio to know where to find those headers and libraries.

First, go to the Tools menu and select Options...:

In the Options dialog, expand the Projects and Solutions tree item and then select VC++ Directories.

Make sure Win32 is selected in the Platform option menu and then select Executable files under Show directories for:, like so:

Scroll to the bottom of the listbox and add the bin paths for your installed libiconv and Gtk+ bundle just like in the above screenshot.

After you've added the bin paths, you'll need to add the #include paths. Select Include files under Show directories for: and, like you did for the Executable paths you added above, add the paths to the include files.

Next, you'll need to ad the library paths. Under Show directories for:, select Library files and add the appropriate paths to the bottom of the list.

And that's it! You are now ready to start building GMime!

Note: I've been using Microsoft Visual C++ 2008 Express Edition which you can download for free from Microsoft.

Tuesday, March 31, 2009

GMime Ported to Windows

The other day someone asked me about GMime on Windows and I didn't have anything to tell her other than that I thought people had built it successfully on Windows but beyond that, I didn't have any sort of VS project files or anything.

Then, last night, she poked me again asking for advice about some of the problems she was having building on Windows (which were mostly removing extraneous unistd.h includes from files that didn't need them and dropping source files that required them).

After a short while of back and forth, I decided I'd just boot into Windows myself and she helped me get my system setup (installing GNU Libiconv, grabbing the Gtk+ Windows dev packages, configuring VS to know where to look for those headers/libs, etc) so that I could more easily get instant feedback as to whether my source changes fixed the compile errors.

After a few hours of #include fixage and slight reworkings of some unix-specific code, we had a successful build of GMime.dll, woohoo!

Sunday, October 5, 2008

Parallel-Installable Gtk-Docs

I recently release GMime 2.4.0 but missed the fact that the gtk-docs would install into the same location as the old 2.2 docs, namely ${prefix}/share/gtk-doc/html/gmime. Thwe problem is clear, since GMime 2.4 changed API (rather than just extending the 2.0/2.2 API), GMime 2.4 really needs to be fully parallel installable.

Thanks to Gilles Dartiguelongue from the Gentoo community for discovering this and bringing it to my attention.

As I tried to figure out how to accomplish what I needed, I discovered that Gtk-Doc needed some fixes so I did a quick hack and submitted it upstream.

Unfortunately, my original fix was broken in that I didn't realize that a simple version extension to the install directory wouldn't be enough because GNOME's devhelp program expected the .devhelp and .devhelp2 files would be named with the same version extension, or they would not be detected.

After that was pointed out to me, I looked deeper into the problem and discovered it wasn't such a terribly hard fix afterall, just a little shell scripting magic was all that was needed ;-)

Since Stefan Kost (the gtk-doc maintainer) and myself were both interested in such a solution, I figured there must be other library maintainers who would be interested in this and so thought I would blog about it.

For those interested in packaging GMime 2.4.x for the various distros, I'll be releasing a 2.4.2 release in the next day or so (going to wait a little longer just in case anyone finds any other last minute parallel-install bugs in the gmime-2.4.1.1 tarball I put together for testing; it's bad enough that I've already released a gmime-2.4.1 with nothing but parallel install fixes and that 2.4.2 will be more of the same, I'd rather not have to issue a 2.4.3).

Saturday, August 16, 2008

GMime Devel List

The GNOME admins have just approved and created a mailing list for discussing GMime development (meant both for contributors to the GMime library as well as developers using GMime).

You can subscribe to the mailing-list at http://mail.gnome.org/mailman/listinfo/gmime-devel-list.

Saturday, June 7, 2008

GMime 2.3.2

Just released GMime 2.3.2 last night which took some comments about the GMimeHeaderIter API into consideration (like allowing iters to be on the stack).

It also now has support for multipart/encrypted PGP/MIME parts that have been both signed and encrypted in a single pass (support for this exists for both encrypting and decrypting).

Many more .NET API improvements as well.

Check it out!

Updated: Just released 2.3.3 after a weekend of hacking.

Saturday, May 31, 2008

GMime 2.3.1

I've been continuing to make improvements to the API over the past few days and have just made a 2.3.1 release.

The main focus of this release has been a redesign of the header API which has been nagging at me for quite a while now. To fix it, I've done the following:

Renamed GMimeHeader to GMimeHeaderList, GMimeHeader felt more like it should be a single header (name/value pair).
Introduced a new GMimeHeader, altho it is internal only at the moment. I had originally planned on making it public (hence the rename of the old GMimeHeader struct).
Introduced a GMimeHeaderIter which can be used to iterate forward and backward over the headers, allowing you to change header values (but not names). It also allows removal of headers which makes it possible to remove any header you want, whereas the old API would only let you remove a header by name (which was a problem if there was more than 1 by the same name).
Got rid of g_mime_header_foreach() because that API sucked and has been replaced by the far more useful GMimeHeaderIter API.

The GMimeHeaderIter API looks like this:

GMimeHeaderIter *g_mime_header_iter_copy (GMimeHeaderIter *iter);
void g_mime_header_iter_free (GMimeHeaderIter *iter);

bool g_mime_header_iter_equal (GMimeHeaderIter *iter1, GMimeHeaderIter *iter2);

bool g_mime_header_iter_is_valid (GMimeHeaderIter *iter);

bool g_mime_header_iter_first (GMimeHeaderIter *iter);
bool g_mime_header_iter_last (GMimeHeaderIter *iter);

bool g_mime_header_iter_next (GMimeHeaderIter *iter);
bool g_mime_header_iter_prev (GMimeHeaderIter *iter);

gint64 g_mime_header_iter_get_offset (GMimeHeaderIter *iter);
const char *g_mime_header_iter_get_name (GMimeHeaderIter *iter);
bool g_mime_header_iter_set_value (GMimeHeaderIter *iter, const char *value);
const char *g_mime_header_iter_get_value (GMimeHeaderIter *iter);

/* if the iter is valid, removes the current header and advances 
   to the next - returns TRUE on success or FALSE otherwise */
bool g_mime_header_iter_remove (GMimeHeaderIter *iter);

Currently, the way to get a GMimeHeaderIter is to call g_mime_header_list_get_iter() which returns a newly allocated iter. I'm not sure if this is the best API or not tho... some other thoughts I've had are:

GMimeHeaderIter *g_mime_header_iter_new (void);
bool g_mime_header_list_get_iter (GMimeHeaderList *list, GMimeHeaderIter *iter);

The second function would initialize iter and return TRUE, or, if list was empty, it could return FALSE.

Another option would be to just have:

GMimeHeaderIter *g_mime_header_iter_new (GMimeHeaderList *list);

Then, if list is empty you'd just get back an invalidated iter. If, later, headers were added to list, then perhaps the iter could auto-magically become valid again. This would more-or-less work the same as the current code - except that the way to instantiate a GMimeHeaderIter is different.

To be honest, now that I've written these 2 ideas down, I think I prefer the first idea.

So... iterators. Since you can have multiple iterators for the same GMimeHeaderList, it is important to invalidate them on at least two occasions:

when the GMimeHeaderList is destroyed
when the header that an iter points to is removed; either by a call to g_mime_header_list_remove() or via a call to g_mime_header_iter_remove() on another iter instance.

You'll be pleased to note that only iters that reference a header that got removed are invalidated. This fact seems to be an argument in favor of my first idea above, as it would allow re-use of iters once they get invalidated.

Thursday, May 29, 2008

GMime 2.3.0

GMime 2.3.0 has been released which marks the first development release of GMime since 2002. A new age has begun.

This new development release gets rid of off_t from all public interfaces, replacing them with gint64 instead so that there is no API/ABI breakage when building it with Large File support. But more importantly (to me), it also fixes a number of the API inconsistencies/uglyness that I discovered while looking over the GMime-Sharp bindings.

The GMime-Sharp bindings still have a long way to go to be beautiful, but they are becoming more of a priority as I am more interested in using GMime-Sharp myself. If you are a user of the GMime-Sharp bindings, I'd love for you to try out GMime 2.3.0 and give me some feedback and how to improve the API even more. I, myself, will continue to scrutinize the API and make adjustments as I go - in the end, I hope to have the most beautiful MIME .NET API ever created.

At first, I was only going to fix GMime's public interfaces to use gint64 instead of off_t and save the rest of my changes for a "3.0" or at least a "2.6", but I decided that since I was breaking the ABI anyway - I might as well do some cleanup and I think developers worldwide will thank me after their initial pain of porting their software to the new API (for which I hope to write a script to help in their troubles).

Largely these changes are renaming of functions, sometimes shuffling them to become methods on a base class, etc.

I got rid of GMimePartEncodingType and created a GMimeContentEncoding enum instead, complete with a g_mime_content_encoding_to_string() and g_mime_content_encoding_from_string(). I also created a new state object called GMimeEncoding (may rename to GMimeEncoder?) which simplifies the base64/QuotedPrintable/UU encoder/decoder APIs a little in that it handles initializing the state variables for you as you encode or decode to any of those encodings.

As I was updating code to use these new enums/structs, I was able to simplify GMimeFilterBasic to not need its own GMimeFilterBasicType struct - instead, you pass the encoding and a bool to encode vs decode to the new g_mime_filter_basic_new() constructor. This has helped simplify a lot of code inside GMime and I'm sure it will help simply everyone elses code as well, because there is no longer a need to map GMIME_PART_ENCODING_BASE64 to GMIME_FILTER_BASIC_TYPE_BASE64_ENC or GMIME_FILTER_BASIC_TYPE_BASE64_DEC.

I've also fixed GMimeContentType and GMimeContentDisposition to notify their parent GMimeObject when they are changed so that the following C#ism becomes possible:

part.ContentType.SetParameter ("name", "fubar.txt");

Previously, you'd have to use the GMimeObject's Content-Type helper methods instead of using the Content-Type's methods directly, like so:

part.SetContentTypeParameter ("name", "fubar.txt");

There are quite a few other improvements like this that I've added to the GMime-Sharp API as well. Sure, they may be minor, but they make the API much more polished.

Wednesday, February 6, 2008

Optimizing GMime's UUEncoder

This past weekend I was talking with Andreia about how Pan is built on top of GMime and takes advantage of my awesomely speedy uuencode/uudecode routines which reminded me that I had done some performance comparisons of GMime's uuencode program vs. the one in the GNU sharutils package a number of years ago.

I had compared GMime 1.90.0 (which was a pre-release of GMime 2.0) and GNU sharutils 4.2.0 and the results were pretty impressive... GMime's uuencoder was on the order of 3 times faster than the one in sharutils and produced exactly the same results.

The uudecoder and the base64 encoder/decoder were all roughly on the order of 7 times faster than those in GNU sharutils, so all around GMime outperformed GNU sharutils by quite a bit.

Anyways, re-reading my test results got me thinking that my uuencode routines could probably be optimized a bit more as they were lagging a bit behind the base64 encoder routine and there's really no reason it should be that far off.

Well, tonight I finally got off my butt and decided to take a look and figure out why. Upon scrolling down to my uuencode_step() routine, I immediately saw why:

Each loop would collect up to 3 bytes from the input and bit shift them into a 32bit 'saved' variable (which is a state variable used for incremental uuencoding an input stream). Then, if I had successfully extracted 3 bytes from the input, I would extract them out of 'saved' into 3 unsigned char variables. At this point I would then encode them into a temporary output buffer. When this output buffer ('uubuf') grew to 60 bytes, I'd flush it to the real output buffer with a memcpy().

All of this extra copying of data around adds up after a while and really begins to impact performance.

Before making any changes, I timed how long it took the original version of my uuencode_step() function to encode linux-2.6.24.tar.gz on my system[1]. An average result over numerous runs was as follows:

[fejj@localhost ~]$ time `gmime-uuencode linux-2.6.24.tar.gz linux-2.6.24.tar.gz > /dev/null`
real    0m0.470s
user    0m0.412s
sys     0m0.052s

After my rewrite, my new results were closer to:

[fejj@localhost ~]$ time `gmime-uuencode linux-2.6.24.tar.gz linux-2.6.24.tar.gz > /dev/null`
real    0m0.291s
user    0m0.252s
sys     0m0.024s

For the sake of comparison, the best time I could manage to get from GNU sharutils 4.6.2 was as follows:

[fejj@localhost ~]$ time `uuencode linux-2.6.24.tar.gz linux-2.6.24.tar.gz > /dev/null`
real    0m1.386s
user    0m1.276s
sys     0m0.092s

The new implementation of uuencode_step() in gmime/gmime-utils.c has been committed to the gmime svn module on GNOME's subversion server, revision 1216 - this change should appear in the next release of GMime which will likely be 2.2.17.

Notes:

1. The system I tested this on was my Lenovo T61 laptop w/ a 7200 RPM harddrive running OpenSuSE 10.3 with various updates. The kernel was version 2.6.22.13-0.3-bigsmp.

From /proc/cpuinfo:

model name : Intel(R) Core(TM)2 Duo CPU T7700 @ 2.40GHz
cpu MHz : 800.000

(e.g. my cpu was scaled down at the time of testing)

2. The GMime uuencode implementation uses a GMimeStreamFs for input as well as output. This stream class is a wrapper around the POSIX I/O system functions which unfortunately has a sub-optimal need to perform an lseek() before each read() or write() call in order to make sure that the underlying file descriptor is in the expected position. This is necessary because it is possible for multiple streams to re-use the same fd.

I mention this because an obvious rebuttal to GMime's superior performance might be to suspect that GMime's uuencode implementation "cheated" by using an mmap()'d input buffer where the GNU sharutils implementation might not.

Thursday, December 27, 2007

Christmas Vacation Hacks

Evolution

There's been discussion about Evolution not handling some brokenly rfc2047 encoded message headers recently, so I ported some workaround logic from GMime to Camel yesterday. I think this should turn a lot of frowns upside down when Evolution 2.22 is released. The new rfc2047 header decoder in Evolution is now more accepting than Thunderbird's decoder[1], so I hope this will satisfy everyone.

The new rfc2047 decoder will now handle the following cases:

encoded-words embedded inside other tokens (e.g n=?iso-8859-1?q?a=EF?=ve)
encoded-words with LWSP in the encoded-text section (e.g. =?iso-8859-1?q?Look ma! there's spaces!?=)
some encoded-word tokens with unencoded special chars in the encoded-text section (such as ? marks and other specials - obviously this can't always work if a special sequence occurrs)
encoded-word tokens with no LWSP between them
no longer aborts if the encoded-text cannot be completely converted to UTF-8 from the charset provided, instead it will attempt to find a more suitable charset to use for conversion and/or replace invalid byte sequences with a '?'

This should fix all of the more common broken-mailer cases, or at least all of the ones that are worth even bothering to try and work around.

GMime

Did some digging and discovered how to document more things using gtk-doc, so GMime's documentation has been getting improved. All of the public functions in GMime now appear in the documentation (I had forgotten to add some of the newer API additions to the gmime-sections.txt file) as well as began writing documentation for each of the doc sections (which I hadn't known how to document previously).

You can find up-to-the-minute GMime documentation here as I continue to chug away at expanding it.

While porting GMime's rfc2047 decoding logic to Evolution, I discovered some areas I could improve in GMime too.

1. After having ported my GMime decoder logic to Camel, I was curious to see how Thunderbird handled broken rfc2047 encoded headers and so had myself a peek at mozilla/netwerk/mime/src/nsMIMEHeaderParamImpl.cpp and noted that my new logic in Camel would handle everything that Thunderbird would handle and more.

I also took a peek at KMail's rfc2047 header decoding logic and discovered Evolution is now more flexible than it as well (again, Evolution now handles everything that KMail's decoder handles and more).

Thursday, April 12, 2007

GMime is damn fast

Came across this in the zsh archives:

I've been using the stand-alone:
http://www.fourmilab.ch/webtools/base64/

I never realised I already had it in openssl.
Thanks for the tip.

This led me to see what else I might already have or could get in a ready made package rather than having to use a source. (I do need this util in some form on every box and previously all my boxes were SCO OpenServer and I had been using the fourmilab source for years.)

Then I decided to run a few simple time tests, since this gets used a lot in various system() commands in application code and cgi scripts etc...

I expected the simple plain base64 util to be the fastest but I was wrong:

In each case I repeated the same command many times and always got almost exactly the same results, that is, +- 0m0.004 of the numbers shown so these are representative not just a fluke or caching effects or effects of the os being busy elsewhere.

the fourmilab source:
nj4:~ # time base64 </boot/vmlinuz >/dev/null

real 0m0.074s
user 0m0.072s
sys 0m0.000s

openssl:
nj4:~ # time opsnssl base64 </boot/vmlinuz >/dev/null
-bash: opsnssl: command not found

real 0m0.001s
user 0m0.000s
sys 0m0.000s

mimencode comes in the metamail package:
nj4:~ # time mimencode </boot/vmlinuz >/dev/null

real 0m0.078s
user 0m0.076s
sys 0m0.004s

gmime-uuencode:

It actually does base64 not uuencode if you give it the -m or --base64 option.

It comes in the gmime package.

It prepends and appends some junk to the actual base64 output so it's inconvenient for me to actually use:

nj4:~ # echo this is a test |gmime-uuencode -m -
begin-base64 600 -
dGhpcyBpcyBhIHRlc3QK
====
nj4:~ #
As for the speed:
nj4:~ # time gmime-uuencode -m - </boot/vmlinuz >/dev/null

real 0m0.009s
user 0m0.008s
sys 0m0.000s

Ooenssl destroys the rest.

So even though I have the dedicated util it's actually better to use openssl.

As a side benefit, thats one less special thing to maintain on all my boxes.

The funny thing here is that `opsnssl` doesn't even exist, so his performance comparison is kinda... well, non-existent. It's still interesting, though, because `gmime-uuencode -m` can encode the file almost as fast as it takes the system to realize that opsnssl doesn't exist!

I find it amusing that at the end of his comparison section, he spells OpenSSL wrong yet again ("Ooenssl").

Good stuff ;-)

Anyways... we've probably all made goofs like this. I mostly found it amusing because of how awesome GMime is :p