Showing posts with label mimekit. Show all posts

Tuesday, April 11, 2017

Achievement Unlocked: MimeKit and MailKit in official Microsoft docs

Sunday, April 9, 2017

MimeKit 1.14 released

I am pleased to announce the release of MimeKit 1.14!

See below for a list of new features and bug fixes.

About MimeKit

MimeKit is a C# library which may be used for the creation and parsing of messages using the Multipurpose Internet Mail Extension (MIME), as defined by numerous IETF specifications.

MimeKit features an extremely robust high-performance parser designed to be able to preserve byte-for-byte information allowing developers to re-seralize the parsed messages back to a stream exactly as the parser found them. It also features integrated DKIM-Signature, S/MIME v3.2, OpenPGP and MS-TNEF support.

Built on top of .NET, MimeKit can be used with any of the .NET languages including C#, VB.NET, F#, and more. It will also run on any platform that Mono or the new .NET Core runtime have been ported to including Windows, Linux, Mac OS, Windows Phone, Apple TV, Apple Watch, iPhone/iPad, Xbox, PlayStation, and Android devices.

Noteworthy changes in version 1.14

Added International Domain Name support for email addresses.
Added a work-around for mailers that didn't provide a disposition value in a Content-Disposition header.
Added a work-around for mailers that quote the disposition value in a Content-Disposition header.
Added automatic key retrieval functionality for the GnuPG crypto context.
Added a virtual DigestSigner property to DkimSigner so that consumers can hook into services such as Azure. (issue #296)
Fixed a bug in the MimeFilterBase.SaveRemainingInput() logic.
Preserve munged From-lines at the start of message/rfc822 parts.
Map code page 50220 to iso-2022-jp.
Format Reply-To and Sender headers as address headers when using Header.SetValue().
Fixed MimeMessage.CreateFromMailMessage() to set the MIME-Version header. (issue #290)

Installing via NuGet

The easiest way to install MimeKit is via NuGet.

In Visual Studio's Package Manager Console, simply enter the following command:

Install-Package MimeKit

Getting the Source Code

First, you'll need to clone MimeKit from my GitHub repository. To do this using the command-line version of Git, you'll need to issue the following command in your terminal:

git clone --recursive https://github.com/jstedfast/MimeKit.git

Documentation

API documentation can be found at http://mimekit.net/docs.

A copy of the xml formatted API documentation is also included in the NuGet and/or Xamarin Component package.

Thursday, March 19, 2015

Code Review: Microsoft's System.Net.Mail Implementation

For those reading my blog for the first time and don't know who I am, allow myself to introduce... myself.

I'm a self-proclaimed expert on the topic of email, specifically MIME, IMAP, SMTP, and POP3. I don't proclaim myself to be an expert on much, but email is something that maybe 1 or 2 dozen people in the world could probably get away with saying they know more than I do and actually back it up. I've got a lot of experience writing email software over the past 15 years and rarely do I come across mail software that does things better than I've done them. I'm also a critic of mail software design and implementation.

My latest endeavors in the email space are MimeKit and MailKit, both of which are open source and available on GitHub for your perusal should you doubt my expertise.

My point is: I think my review carries some weight, or I wouldn't be writing this.

Is that egotistical of me? Maybe a little.

I was actually just fixing a bug in MimeKit earlier and when I went to go examine Mono's System.Net.Mail.MailMessage implementation in order to figure out what the problem was with my System.Net.Mail.MailMessage to MimeKit.MimeMessage conversion, I thought, "hey, wait a minute... didn't Microsoft just recently release their BCL source code?" So I ended up taking a look and pretty quickly confirmed my suspicions and was able to fix the bug.

When I begin looking at the source code for another mail library, I can't help but critique what I find.

MailAddress and MailAddressCollection

Parsing email addresses is probably the hardest thing to get right. It's what I would say makes or breaks a library (literally). To a casual onlooker, parsing email addresses probably seems like a trivial problem. "Just String.Split() on comma and then look for those angle bracket thingies and you're done, right?" Oh God, oh God, make the hurting stop. I need to stop here before I go into a long rant about this...

Okay, I'm back. Blood pressure has subsided.

Looking at MailAddressParser.cs (the internal parser used by MailAddressCollection), I'm actually pleasantly surprised. It actually looks pretty decent and I can tell that a lot of thought and care went into it. They actually use a tokenizer approach. Interestingly, they parse the string in reverse which is a pretty good idea, I must say. This approach probably helps simplify the parser logic a bit because parsing forward makes it difficult to know what the tokens belong to (is it the name token? or is it the local-part of an addr-spec? hard to know until I consume a few more tokens...).

For example, consider the following BNF grammar:

address         =       mailbox / group
mailbox         =       name-addr / addr-spec
name-addr       =       [display-name] angle-addr
angle-addr      =       [CFWS] "<" addr-spec ">" [CFWS] / obs-angle-addr
group           =       display-name ":" [mailbox-list / CFWS] ";"
                        [CFWS]
display-name    =       phrase
word            =       atom / quoted-string
phrase          =       1*word / obs-phrase
addr-spec       =       local-part "@" domain
local-part      =       dot-atom / quoted-string / obs-local-part
domain          =       dot-atom / domain-literal / obs-domain
obs-local-part  =       word *("." word)

Now consider the following email address: "Joe Example" <joe@example.com>

The first token you read will be "Joe Example" and you might think that that token indicates that it is the display name, but it doesn't. All you know is that you've got a 'quoted-string' token. A 'quoted-string' can be part of a 'phrase' or it can be (a part of) the 'local-part' of the address itself. You must read at least 1 more token before you'll be able to figure out what it actually is ('obs-local-part' makes things slightly more difficult). In this case, you'll get a '<' which indicates the start of an 'angle-addr', allowing you to assume that the 'quoted-string' you just got is indeed the 'display-name'.

If, however, you parse the address in reverse, things become a little simpler because you know immediately what to expect the next token to be a part of.

That's pretty cool. Kudos to the Microsoft engineers for thinking up this strategy.

Unfortunately, the parser does not handle the 'group' address type. I'll let this slide, however, partly because I'm still impressed by the approach the address parser took and also because I realize that System.Net.Mail is meant for creating and sending new messages, not parsing existing messages from the wild.

Okay, so how well does it serialize MailAddress?

Ugh. You know that face you make when you just see a guy get kicked in the nuts? Yea, that's the face I made when I saw line #227:

encodedAddress = String.Format(CultureInfo.InvariantCulture, "\"{0}\"", this.displayName);

The problem with the above code (and I'll soon be submitting a bug report about this) is that the displayName string might have embedded double quotes in it. You can't just surround it with quotes and expect it to work. This is the same mistake all those programmers make that allow SQL-injection attacks.

For an example of how this should be done, see MimeKit's MimeUtils.Quote() method.

I had such high hopes... at least this is a fairly simple bug to fix. I'll probably just offer them a patch.

ContentType and ContentDisposition

Their parser is decent but it doesn't handle rfc2231 encoded parameter values, so I'm not overly impressed. It'll get the job done for simple name="value" parameter syntax, though, and it will decode the values encoded with the rfc2047 encoding scheme (which is not the right way to encode values, but it is common enough that any serious parser should handle it). The code is also pretty clean and uses a tokenizer approach, so that's a plus. I guess since this isn't really meant as a full-blown MIME parser, they can get away with this and not have it be a big deal. Fair enough.

Serialization, unsurprisingly, leaves a lot to be desired. Parameter values are, as I expected, encoded using rfc2047 syntax rather than the IETF standard rfc2231 syntax. I suppose that you could argue that this is for compatibility, but really it's just perpetuating bad practices. It also means that it can't properly fold long parameter values because the encoded value just becomes one big long encoded-word token. Yuck.

Base64

Amusingly, Microsoft does not use their Convert.FromBase64() decoder to decode base64 in their System.Net.Mail implementation. I point this out mostly because it is the single most common problem users have with every one of the Open Source .NET mail libraries out there (other than MimeKit, of course) because Convert.FromBase64() relies on the data not having any line breaks, white space, etc in the input stream.

This should serve as a big hint to you guys writing your own .NET email libraries not to use Convert.FromBase64() ;-)

They use unsafe pointers, just like I do in MimeKit, but I'm not sure how their performance compares to MimeKit's yet. They do use a state machine, though, so rock on.

I approve this base64 encoder/decoder implementation.

SmtpClient

One thing they do which is pretty cool is connection pooling. This is probably a pretty decent win for the types of things developers usually use System.Net.Mail's SmtpClient for (spam, anyone?).

The SASL AUTH mechanisms that they seem to support are NTLM, GSSAPI, LOGIN and WDIGEST (which apparently is some sort of IIS-specific authentication mechanism that I had never heard of until now). For those that were curious which SASL mechanisms SmtpClient supported, well, now you know.

The code is a bit hard to follow for someone not familiar with the codebase (not nearly as easy reading as the address or content-type parsers, I'm afraid), but it seems fairly well designed.

It does not appear to support PIPELINING or BINARYMIME like MailKit does, though. So, yay! Win for MailKit ;-)

They do support SMTPUTF8, so that's good.

It seems that if you set client.EnableSsl to true, it will also try STARTTLS if it isn't able to connect on the SSL port. I wasn't sure if it did that or not before, so this was something I was personally interested in knowing.

Hopefully my SmtpClient implementation review isn't too disappointing. I just don't know what to say about it, really. It's a pretty straight-forward send-command-wait-for-reply implementation and SMTP is pretty dead simple.

Conclusion

Overall the bits I was interested in were better than I expected they'd be. The parsers were pretty good (although incomplete) and the serializers were "good enough" for normal use.

Of course, it's not as good as MimeKit, but let's be honest, MimeKit sets the bar pretty high ;-)

Thursday, October 16, 2014

The Wait Is Over: MimeKit and MailKit Reach 1.0

After about a year in the making for MimeKit and nearly 8 months for MailKit, they've finally reached 1.0 status.

I started really working on MimeKit about a year ago wanting to give the .NET community a top-notch MIME parser that could handle anything the real world could throw at it. I wanted it to run on any platform that can run .NET (including mobile) and do it with remarkable speed and grace. I wanted to make it such that re-serializing the message would be a byte-for-byte copy of the original so that no data would ever be lost. This was also very important for my last goal, which was to support S/MIME and PGP out of the box.

All of these goals for MimeKit have been reached (partly thanks to the BouncyCastle project for the crypto support).

At the start of December last year, I began working on MailKit to aid in the adoption of MimeKit. It became clear that without a way to inter-operate with the various types of mail servers, .NET developers would be unlikely to adopt it.

I started off implementing an SmtpClient with support for SASL authentication, STARTTLS, and PIPELINING support.

Soon after, I began working on a Pop3Client that was designed such that I could use MimeKit to parse messages on the fly, directly from the socket, without needing to read the message data line-by-line looking for a ".\r\n" sequence, concatenating the lines into a massive memory buffer before I could start to parse the message. This fact, combined with the fact that MimeKit's message parser is orders of magnitude faster than any other .NET parser I could find, makes MailKit the fastest POP3 library the world has ever seen.

After a month or so of avoiding the inevitable, I finally began working on an ImapClient which took me roughly two weeks to produce the initial prototype (compared to a single weekend for each of the other protocols). After many months of implementing dozens of the more widely used IMAP4 extensions (including the GMail extensions) and tweaking the APIs (along with bug fixing) thanks to feedback from some of the early adopters, I believe that it is finally complete enough to call 1.0.

In July, at the request of someone involved with a number of the IETF email-related specifications, I also implemented support for the new Internationalized Email standards, making MimeKit and MailKit the first - and only - .NET email libraries to support these standards.

If you want to do anything at all related to email in .NET, take a look at MimeKit and MailKit. I guarantee that you will not be disappointed.

Monday, March 10, 2014

GMime gets a Speed Boost

With all of the performance improvements I've been putting into MimeKit recently, it was about time to port some of these optimizations back to GMime.

In addition to other fixes that were in the queue, GMime 2.6.20 includes the "SIMD" optimization hack that I blogged about doing for MimeKit and I wanted to share the results. Below is a comparison of GMime 2.6.19 and 2.6.20 parsing the same 2GB mbox file on my 2011 Core-i5 iMac with the "persistent stream" option enabled on the GMimeParser:

[fejj@localhost gmime-2.6.19]$ ./gmime-mbox-parser really-big.mbox
Parsed 29792 messages in 5.15 seconds.

[fejj@localhost gmime-2.6.20]$ ./gmime-mbox-parser really-big.mbox
Parsed 29792 messages in 4.70 seconds.

That's a pretty respectable improvement. Interestingly, though, it's still not as fast as MimeKit utilizing Mono's LLVM backend:

[fejj@localhost MimeKit]$ mono --llvm ./mbox-parser.exe really-big.mbox
Parsed 29792 messages in 4.52 seconds.

Of course, to be fair, without the --llvm option, MimeKit doesn't fare quite so well:

[fejj@localhost MimeKit]$ mono ./mbox-parser.exe really-big.mbox
Parsed 29792 messages in 5.54 seconds.

I'm not sure what kind of optimizations LLVM utilizes when used from Mono vs clang (used to compile GMime via homebrew, which I suspect uses -O2), but nevertheless, it's still very impressive.

After talking with Rodrigo Kumpera from the Mono runtime team, it sounds like the --llvm option is essentially the -O2 optimizations minus a few of the options that cause problems with the Mono runtime, so effectively somewhere between -O1 and -O2.

I'd love to find out why MimeKit with the LLVM optimizer is faster than GMime compiled with clang (which also makes use of LLVM) with the same optimizations, but I think it'll be pretty hard to narrow down exactly because MimeKit isn't really a straight port of GMime (they are similar, but a lot of MimeKit is all-new in design and implementation).

Monday, February 3, 2014

Introducing MailKit, a cross-platform .NET mail-client library

Once I announced MimeKit, I knew it would only be a matter of time before I started getting asked about SMTP, IMAP, and/or POP3 support.

Let's just say,

Challenge... ACCEPTED!

I started off back in early December writing an SmtpClient so that developers using MimeKit wouldn't have to convert a MimeMessage to a System.Net.Mail.MailMessage in order to send it using System.Net.Mail.SmtpClient. This went pretty quickly because I've implemented several SMTP clients in the past. Implementing the various SASL authentication mechanisms probably took as much or more time than implementing the SMTP protocol.

The following weekend, I ended up implementing a Pop3Client. Originally, I had planned on more-or-less cloning the API we had used in Evolution, but I decided that I would take a different approach. I designed a simple IMessageSpool interface which more closely follows the limited functionality of POP3 and mbox spools instead of trying to map the Pop3Client to a Store/Folder paradigm like JavaMail and Evolution do (Evolution's mail library was loosely based on JavaMail). Mapping mbox and POP3 spools to Stores and Folders in Evolution was, to my recollection, rather awkward and I wanted to avoid that with MailKit.

At first I was loathe to do it, but over the past 2 weeks I ended up writing an ImapClient as well. I'm sure Philip van Hoof will be pleased to note that I have a very nice BODYSTRUCTURE parser, although that API is not publicly exported.

Unlike the SmtpClient and Pop3Client, the ImapClient does not have all of its functionality on a single public class. Instead, ImapClient implements an IMessageStore which has a limited API, mostly meant for getting IFolders. I imagine that those who are familiar with the JavaMail and/or Evolution (Camel) APIs will recognize this design.

The IFolder interface isn't designed to be exactly like the JavaMail Folder API, though. I've been designing the interface incrementally as I implement the various IMAP extensions (I've found at least 37 of them at the time of this blog post, although I don't think I'll bother with ACL, MAILBOX-REFERRAL, or LOGIN-REFERRAL), so the API may continue to evolve as I go, but I think what I've got now will likely remain - I'll probably just be including additional APIs for the new stuff.

So far, I've implemented the following IMAP extensions: LITERAL+, NAMESPACE, CHILDREN, LOGIN-DISABLED, STARTTLS, MULTIAPPEND, UNSELECT, UIDPLUS, CONDSTORE, ESEARCH, SASL-IR, SORT, THREAD, SPECIAL-USE, MOVE, XLIST, and X-GM-EXT1. Phew, that was exhausting listing all of those!

Also news-worthy is that MimeKit is now equally as fast as GMime, which is pretty impressive considering that it is fully managed C# code.

Download MailKit 0.2 now and let the hacking begin!

Monday, October 7, 2013

Optimization Tips & Tricks used by MimeKit: Part 2

In my previous blog post, I talked about optimizing the most critical loop in MimeKit's MimeParser by:

Extending our read buffer by an extra byte (which later became 4 extra bytes) that I could set to '\n', allowing me to do the bounds check after the loop as opposed to in the loop, saving us roughly half the instructions.
Unrolling the loop in order to check for 4 bytes at a time for that '\n' by using some bit twiddling hacks (for 64-bit systems, we might gain a little more performance by checking 8 bytes at a time).

After implementing both of those optimizations, the time taken for MimeKit's parser to parse nearly 15,000 messages in a ~1.2 gigabyte mbox file dropped from around 10s to about 6s on my iMac with Mono 3.2.3 (32-bit). That is a massive increase in performance.

Even after both of those optimizations, that loop is still the most critical loop in the parser and the MimeParser.ScanContent() method, which contains it, is still the most critical method of the parser.

While the loop itself was a huge chunk of the time spent in that method, the next largest offender was writing the content of the MIME part into a System.IO.MemoryStream.

MemoryStream, for those that aren't familiar with C#, is just what it sounds like it is: a stream backed by a memory buffer (in C#, this happens to be a byte array). By default, a new MemoryStream starts with a buffer of about 256 bytes. As you write more to the MemoryStream, it resizes its internal memory buffer to either the minimum size needed to hold the its existing content plus whatever number of bytes your latest Write() was called with or double the current internal buffer size, whichever is larger.

The performance problem here is that for MIME parts with large amounts of content, that buffer will be resized numerous times. Each time that buffer is resized, due to the way C# works, it will allocate a new buffer, zero the memory, and then copy the old content over to the new buffer. That's a lot of copying and creates a situation where the write operation can become exponentially worse as the internal buffer gets larger. Since MemoryStream contains a GetBuffer() method, its internal buffer really has to be a single contiguous block of memory. This means that there's little we could do to reduce overhead of zeroing the new buffer every time it resizes beyond trying to come up with a different formula for calculating the next optimal buffer size.

At first I decided to try the simple approach of using the MemoryStream constructor that allows specifying an initial capacity. By bumping up the initial capacity to 2048 bytes, things did improve, but only by a very disappointing amount. Larger initial capacities such as 4096 and 8192 bytes also made very little difference.

After brainstorming with my coworker and Mono runtime hacker, Rodrigo Kumpera, we decided that one way to solve this performance problem would be to write a custom memory-backed stream that didn't use a single contiguous block of memory, but instead used a list of non-contiguous memory blocks. When this stream needed to grow its internal memory storage, all it would need to do is allocate a new block of memory and append it to its internal list of blocks. This would allow for minimal overhead because only the new block would need to be zeroed and no data would need to be re-copied, ever. As it turns out, this approach would also allow me to limit the amount of unused memory used by the stream.

I dubbed this new memory-backed stream MimeKit.IO.MemoryBlockStream. As you can see, the implementation is pretty trivial (doesn't even require scary looking bit twiddling hacks like my previous optimization), but it made quite a difference in performance. By using this new memory stream, I was able to shave a full second off of the time needed to parse that mbox file I mentioned earlier, getting the total time spent down to 5s. That's starting to get pretty respectable, performance-wise.

As a comparison, let's compare the performance of MimeKit with what seems to be the 2 most popular C# MIME parsers out there (OpenPOP.NET and SharpMimeTools) and see how we do. I've been hyping up the performance of MimeKit a lot, so it had better live up to expectations, right? Let's see if it does.

Now, since none of the other C# MIME parsers I could find support parsing the Unix mbox file format, we'll write some test programs to parse the same message stream over and over (say, 20 thousand times) to compare MimeKit to OpenPOP.NET.

Here's the test program I wrote for OpenPOP.NET:

using System;
using System.IO;
using System.Diagnostics;
using OpenPop.Mime;

namespace OpenPopParser {
    class Program
    {
        public static void Main (string[] args)
        {
            var stream = File.OpenRead (args[0]);
            var stopwatch = new Stopwatch ();

            stopwatch.Start ();
            for (int i = 0; i < 20000; i++) {
                var message = Message.Load (stream);
                stream.Position = 0;
            }
            stopwatch.Stop ();

            Console.WriteLine ("Parsed 20,000 messages in {0}", stopwatch.Elapsed);
        }
    }
}

Here's the SharpMimeTools parser I wrote for testing:

using System;
using System.IO;
using System.Diagnostics;
using anmar.SharpMimeTools;

namespace SharpMimeParser {
    class Program
    {
        public static void Main (string[] args)
        {
            var stream = File.OpenRead (args[0]);
            var stopwatch = new Stopwatch ();

            stopwatch.Start ();
            for (int i = 0; i < 20000; i++) {
                var message = new SharpMessage (stream);
                stream.Position = 0;
            }
            stopwatch.Stop ();

            Console.WriteLine ("Parsed 20,000 messages in {0}", stopwatch.Elapsed);
        }
    }
}

And here is the test program I used for MimeKit:

using System;
using System.IO;
using System.Diagnostics;
using MimeKit;

namespace MimeKitParser {
    class Program
    {
        public static void Main (string[] args)
        {
            var stream = File.OpenRead (args[0]);
            var stopwatch = new Stopwatch ();

            stopwatch.Start ();
            for (int i = 0; i < 20000; i++) {
                var parser = new MimeParser (stream, MimeFormat.Default);
                var message = parser.ParseMessage ();
                stream.Position = 0;
            }
            stopwatch.Stop ();

            Console.WriteLine ("Parsed 20,000 messages in {0}", stopwatch.Elapsed);
        }
    }
}

Note: Unfortunately, OpenPOP.NET's message parser completely failed to parse the Star Trek message I pulled out of my test suite at random (first message in the jwz.mbox.txt file included in MimeKit's UnitTests project) due to the Base64 decoder not liking some byte or another in the stream, so I had to patch OpenPOP.NET to no-op its base64 decoder (which, if anything, should make it faster).

And here are the results running on my 2011 MacBook Air:

[fejj@localhost OpenPopParser]$ mono ./OpenPopParser.exe ~/Projects/MimeKit/startrek.msg
Parsed 20,000 messages in 00:06:26.6825190

[fejj@localhost SharpMimeParser]$ mono ./SharpMimeParser.exe ~/Projects/MimeKit/startrek.msg
Parsed 20,000 messages in 00:19:30.0402064

[fejj@localhost MimeKit]$ mono ./MimeKitParser.exe ~/Projects/MimeKit/startrek.msg
Parsed 20,000 messages in 00:00:15.6159326

Whooooosh!

Not. Even. Close.

MimeKit is nearly 25x faster than OpenPOP.NET even after making its base64 decoder a no-op and 75x faster than SharpMimeTools.

Since I've been ranting against C# MIME parsers that made heavy use of regex, let me show you just how horrible regex is for parsing messages (performance-wise). There's a C# MIME parser called MIMER that is nearly pure regex, so what better library to illustrate my point? I wrote a very similar loop to the other 2 that I listed above, so I'm not going to bother repeating it again. Instead, I'll just skip to the results:

[fejj@localhost MimerParser]$ mono ./MimerParser.exe ~/Projects/MimeKit/startrek.msg
Parsed 20,000 messages in 00:16:51.4839129

Ouch. MimeKit is roughly 65x faster than a fully regex-based MIME parser. It's actually rather pathetic that this regex parser beats SharpMimeTools.

This is why, as a developer, it's important to understand the limitations of the tools you decide to use. Regex is great for some things but it is a terrible choice for others. As Jamie Zawinski might say,

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

Monday, September 30, 2013

Optimization Tips & Tricks used by MimeKit: Part 1

One of the goals of MimeKit, other than being the most robust MIME parser, is to be the fastest C# MIME parser this side of the Mississippi. Scratch that, fastest C# MIME parser in the World.

Seriously, though, I want to get MimeKit to be as fast and efficient as my C parser, GMime, which is one of the fastest (if not the fastest) MIME parsers out there right now, and I don't expect that any parser is likely to smoke GMime anytime soon, so using it as a baseline to compare against means that I have a realistic goal to set for MimeKit.

Now that you know the why, let's examine the how.

First, I'm using one of those rarely used features of C#: unsafe pointers. While that alone is not all that interesting, it's a corner stone for one of the main techniques I've used. In C#, the fixed statement (which is how you get a pointer to a managed object) pins the object to a fixed location in memory to prevent the GC from moving that memory around while you operate on that buffer. Keep in mind, though, that telling the GC to pin a block of memory is not free, so you should not use this feature without careful consideration. If you're not careful, using pointers could actually make your code slower. Now that we've got that out of the way...

MIME is line-based, so a large part of every MIME parser is going to be searching for the next line of input. One of the reasons most MIME parsers (especially C# MIME parsers) are so slow is because they use a ReadLine() approach and most TextReaders likely use a naive algorithm for finding the end of the current line (as well as all of the extra allocating and copying into a string buffer):

    // scan for the end of the line
    while (inptr < inend && *inptr != (byte) '\n')
        inptr++;

The trick I used in GMime was to make sure that my read buffer was 1 byte larger than the max number of bytes I'd ever read from the underlying stream at a given time. This allowed me to set the first byte in the buffer beyond the bytes I just read from the stream to '\n', thus allowing for the ability to remove the inptr < inend check, opting to do the bounds check after the loop has completed instead. This nearly halves the number of instructions used per loop, making it much, much faster. So, now we have:

    // scan for the end of the line
    while (*inptr != (byte) '\n')
        inptr++;

But is that the best we can do?

Even after using this trick, it was still the hottest loop in my parser:

We've got no choice but to use a linear scan, but that doesn't mean that we can't do it faster. If we could somehow reduce the number of loops and likewise reduce the number of pointer increments, we could eliminate a bunch of the overhead of the loop. This technique is referred to as loop unrolling. Here's what brianonymous (from the ##csharp irc channel on freenode) and I came up with (with a little help from Sean Eron Anderson's bit twiddling hacks):

    uint* dword = (uint*) inptr;
    uint mask;

    do {
        mask = *dword++ ^ 0x0A0A0A0A;
        mask = ((mask - 0x01010101) & (~mask & 0x80808080));
    } while (mask == 0);

And here are the results of that optimization:

Now, keep in mind that on many architectures other than x86, in order to employ the trick above, inptr must first be 4-byte aligned (uint is 32bit) or it could cause a SIGBUS or worse, a crash. This is fairly easy to solve, though. All you need to do is increment inptr until you know that it is 4 byte aligned and then you can switch over to reading 4 bytes at a time as in the above loop. We'll also need to figure out which of those 4 bytes contained the '\n'. An easy way to solve that problem is to just linearly scan those 4 bytes using our previous single-byte-per-loop implementation starting at dword - 1. Here it is, your moment of Zen:

    // Note: we can always depend on byte[] arrays being
    // 4-byte aligned on 32bit and 64bit architectures
    int alignment = (inputIndex + 3) & ~3;
    byte* aligned = inptr + alignment;
    byte* start = inptr;
    uint mask;

    while (inptr < aligned && *inptr != (byte) '\n')
        inptr++;

    if (inptr == aligned) {
        // -funroll-loops
        uint* dword = (uint*) inptr;

        do {
            mask = *dword++ ^ 0x0A0A0A0A;
            mask = ((mask - 0x01010101) & (~mask & 0x80808080));
        } while (mask == 0);

        inptr = (byte*) (dword - 1);
        while (*inptr != (byte) '\n')
            inptr++;
    }

Note: In this above code snippet, 'inputIndex' is the byte offset of 'inptr' into the byte array. Since we can safely assume that index 0 is 4-byte aligned, we can do a simple calculation to get the next multiple of 4 and add that to our 'inptr' to get the next 4-byte aligned pointer.

That's great, but what does all that hex mumbo jumbo do? And why does it work?

Let's go over this 1 step at a time...

    mask = *dword++ ^ 0x0A0A0A0A;

This xor's the value of dword with 0x0A0A0A0A (0x0A0A0A0A is just 4 bytes of '\n'). The xor sets every byte that is equal to 0x0A to 0 in mask. Every other byte will be non-zero.

    mask - 0x01010101

When we subtract 0x01010101 from mask, the result will be that only bytes greater than 0x80 will contain any high-order bits (and any byte that was originally 0x0A in our input will now be 0xFF).

    ~mask & 0x80808080

This inverts the value of mask resulting in no bytes having the highest bit set except for those that had a 0 in that slot before (including the byte we're looking for). By then bitewise-and'ing it with 0x80808080, we get 0x80 for each byte that was originally 0x0A in our input or otherwise had the highest bit set after the bit inversion.

Because there's no way for any byte to have the highest bit set in both sides of the encompassing bitwise-and except for the character we're looking for (0x0A), the mask will always be 0 unless any of the bytes within were originally 0x0A, which would then break us out of the loop.

Well, that concludes part 1 as it is time for me to go to bed so I can wake up at a reasonable time tomorrow morning.

Good night!

Saturday, September 28, 2013

MimeKit: Coming to a NuGet near you.

If, like me, you've been trapped in the invisible box of despair, bemoaning the woeful inadequacies of every .NET MIME library you've ever found on the internets, cry no more: MimeKit is here.

I've just released MimeKit v0.5 as a NuGet Package. There's still plenty of work left to do, mostly involving writing more API documentation, but I don't expect to change the API much between now and v1.0. For all the mobile MIME lovers out there, you'll be pleased to note that in addition to the .NET Framework 4.0 assembly, the NuGet package also includes assemblies built for Xamarin.Android and Xamarin.iOS. It's completely open source and licensed under the MIT/X11 license, so you can use it in any project you want - no restrictions. Once MimeKit goes v1.0, I plan on adding it to Xamarin's Component Store as well for even easier mobile development. If that doesn't turn that frown upside down, I don't know what will.

For those that don't already know, MimeKit is a really fast MIME parser that uses a real tokenizer instead of regular expressions and string.Split() to parse and decode headers. Among numerous other things, it can properly handle rfc2047 encoded-word tokens that contain quoted-printable and base64 payloads which have been improperly broken apart (i.e. a quoted-printable triplet or a base64 quartet is split between 2 or more encoded-word tokens) as well as handling cases where multibyte character sequences are split between words thanks to the state machine nature of MimeKit's rfc2047 text and phrase decoders (yes, there are 2 types of encoded-word tokens - something most other MIME parsers have failed to take notice of). With the use of MimeKit.ParserOptions, the user can specify his or her own fallback charset (in addition to UTF-8 and ISO-8859-1 that MimeKit has built in), allowing MimeKit to gracefully handle undeclared 8bit text in headers.

When constructing MIME messages, MimeKit provides the user with the ability to specify any character encoding available on the system for encoding each individual header (or, in the case of address headers: each individual email address). If none is specified, UTF-8 is used unless the characters will fit nicely into ISO-8859-1. MimeKit's rfc2047 and rfc2231 encoders do proper breaking of text (i.e it avoids breaking between surrogate pairs) before the actual encoding step, thus ensuring that each encoded-word token (or parameter value) is correctly self-contained.

S/MIME support is also available in the .NET Framework 4.0 assembly (not yet supported in the Android or iOS assemblies due to the System.Security assembly being unavailable on those platforms). MimeKit supports signing, encrypting, decrypting, and verifying S/MIME message parts. For signing, you can either use the preferred multipart/signed approach or the application/[x-]pkcs7-signature mime-type, whichever you prefer.

I'd love to support PGP/MIME as well, but this is a bit more complicated as I would likely need to depend on external native libraries and programs (such as GpgME and GnuPG) which means that MimeKit would likely have to become 32bit-only (currently, libgpgme is only available for 32bit Windows).

I hope you enjoy using MimeKit as much as I have enjoyed implementing it!

Note: For those using my GMime library, fear not! I have not forgotten about you! I plan to bring many of the API and parser improvements that I've made to MimeKit back to GMime in the near future.

For those using the C# bindings, I'd highly recommend that you consider switching to MimeKit instead. I've based MimeKit's API on my GMime API, so porting to MimeKit should be fairly straightforward.

Sunday, September 15, 2013

Time for a rant on mime parsers...

Warning: Viewer discretion is advised.

Where should I begin?

I guess I should start by saying that I am obsessed with MIME and, in particular, MIME parsers. No, really. I am obsessed. Don't believe me? I've written and/or worked on several MIME parsers at this point. It started off in my college days working on Spruce which had a horrendously bad MIME parser, and so as you read farther along in my rant about shitty MIME parsers, keep in mind: I've been there, I've written a shitty MIME parser.

As a handful of people are aware, I've recently started implementing a C# MIME parser called MimeKit. As I work on this, I've been searching around on GitHub and Google to see what other MIME parsers exist out there to find out what sort of APIs they provide. I thought perhaps I'll find one that offers a well-designed API that will inspire me. Perhaps, by some miracle, I'd find one that was actually pretty good that I could just contribute to instead of writing my own from scratch (yea, wishful thinking). Instead, all I have found are poorly designed and implemented MIME parsers, many probably belong on the front page of the Daily WTF.

I guess I'll start with some softballs.

First, there's the fact that every single one of them were written as System.String parsers. Don't be fooled by the ones claiming to be "stream parsers", because all any of those did was to slap a TextReader on top of the byte stream and start using reader.ReadLine(). What's so bad about that, you ask? For those not familiar with MIME, I'd like for you to take a look at the raw email sources in your inboxes particularly if you have correspondence with anyone outside of the US. Hopefully most of your friends and colleagues are using more-or-less MIME compliant email clients, but I guarantee you'll find at least a few emails with raw 8bit text.

Now, if the language they were using was C or C++, they might be able to get away with doing this because they'd technically be operating on byte arrays, but with Java and C#, a 'string' is a unicode string. Tell me: how does one get a unicode string from a raw byte array?

Bingo. You need to know the charset before you can convert those bytes into unicode characters.

To be fair, there's really no good way of handling raw 8bit text in message headers, but by using a TextReader approach, you are really limiting the possibilities.

Next up is the ReadLine() approach. One of the 2 early parsers in GMime (pan-mime-parser.c back in the version 0.7 days) used a ReadLine() approach, so I understand the thinking behind this. And really, there's nothing wrong with this approach as far as correctness goes, it's more of a "this can never be fast" complaint. Of the two early parsers in GMime, the pan-mime-parser.c backend was horribly slow compared to the in-memory parser. Of course, that's not very surprising. More surprising to me at the time was that when I wrote GMime's current generation of parser (sometime between v0.7 and v1.0), it was just as fast as the in-memory parser ever was and only ever had up to 4k in a read buffer at any given time. My point is, there are far better approaches than ReadLine() if you want your parser to be reasonably performant... and why wouldn't you want that? Your users definitely want that.

Okay, now come the more serious problems that I encountered in nearly all of the mime parser libraries I found.

I think that every single mime parser I've found so far uses the "String.Split()" approach for parsing address headers and/or for parsing parameter lists on headers such as Content-Type and Content-Disposition.

Here's an example from one C# MIME parser:

string[] emails = addressHeader.Split(',');

Here's how this same parser decodes encoded-word tokens:

private static void DecodeHeaders(NameValueCollection headers)
{
    ArrayList tmpKeys = new ArrayList(headers.Keys);

    foreach (string key in headers.AllKeys)
    {
        //strip qp encoding information from the header if present
        headers[key] = Regex.Replace(headers[key].ToString(), @"=\?.*?\?Q\?(.*?)\?=",
            new MatchEvaluator(MyMatchEvaluator), RegexOptions.IgnoreCase | RegexOptions.Multiline);
        headers[key] = Regex.Replace(headers[key].ToString(), @"=\?.*?\?B\?(.*?)\?=",
            new MatchEvaluator(MyMatchEvaluatorBase64), RegexOptions.IgnoreCase | RegexOptions.Multiline);
    }
}

private static string MyMatchEvaluator(Match m)
{
    return DecodeQP(m.Groups[1].Value);
}

private static string MyMatchEvaluatorBase64(Match m)
{
    System.Text.Encoding enc = System.Text.Encoding.UTF7;
    return enc.GetString(Convert.FromBase64String(m.Groups[1].Value));
}

Excuse my language, but what the fuck? It completely throws away the charset in each of those encoded-word tokens. In the case of quoted-printable tokens, it assumes they are all ASCII (actually, latin1 may work as well?) and in the case of base64 encoded-word tokens, it assumes they are all in UTF-7!?!? Where in the world did he get that idea? I can't begin to imagine his code working on any base64 encoded-word tokens in the real world. If anything is deserving of a double facepalm, this is it.

I'd just like to point out that this is what this project's description states:

A small, efficient, and working mime parser library written in c#.
...
I've used several open source mime parsers before, but they all either
fail on one kind of encoding or the other, or miss some crucial
information. That's why I decided to finally have a go at the problem
myself.

I'll grant you that his MIME parser is small, but I'd have to take issue with the "efficient" and "working" adjectives. With the heavy use of string allocations and regex matching, it could hardly be considered "efficient". And as the code pointed out above illustrates, "working" is a bit of an overstatement.

Folks... this is what you get when you opt for a "lightweight" MIME parser because you think that parsers like GMime are "bloated".

On to parser #2... I like to call this the "Humpty Dumpty" approach:

public static StringDictionary parseHeaderFieldBody ( String field, String fieldbody ) {
    if ( fieldbody==null )
        return null;
    // FIXME: rewrite parseHeaderFieldBody to being regexp based.
    fieldbody = SharpMimeTools.uncommentString (fieldbody);
    StringDictionary fieldbodycol = new StringDictionary ();
    String[] words = fieldbody.Split(new Char[]{';'});
    if ( words.Length>0 ) {
        fieldbodycol.Add (field.ToLower(), words[0].ToLower().Trim());
        for (int i=1; i<words.Length; i++ ) {
            String[] param = words[i].Trim(new Char[]{' ', '\t'}).Split(new Char[]{'='}, 2);
            if ( param.Length==2 ) {
                param[0] = param[0].Trim(new Char[]{' ', '\t'});
                param[1] = param[1].Trim(new Char[]{' ', '\t'});
                if ( param[1].StartsWith("\"") && !param[1].EndsWith("\"")) {
                    do {
                        param[1] += ";" + words[++i];
                    } while ( !words[i].EndsWith("\"") && i<words.Length);
                }
                fieldbodycol.Add ( param[0], SharpMimeTools.parserfc2047Header (param[1].TrimEnd(';').Trim('\"', ' ')) );
            }
        }
    }
    return fieldbodycol;
}

I'll give this guy some credit, at least he saw that his String.Split() approach was flawed and so tried to compensate by piecing Humpty Dumpty back together again. Of course, with his String.Trim()ing, he just won't be able to put him back together again with any level of certainty. The white space in those quoted tokens may have significant meaning.

Many of the C# MIME parsers out there like to use Regex all over the place. Here's a snippet from one parser that is entirely written in Regex (yea, have fun maintaining that...):

if (m_EncodedWordPattern.RegularExpression.IsMatch(field.Body))
{
    string charset = m_CharsetPattern.RegularExpression.Match(field.Body).Value;
    string text = m_EncodedTextPattern.RegularExpression.Match(field.Body).Value;
    string encoding = m_EncodingPattern.RegularExpression.Match(field.Body).Value;

    Encoding enc = Encoding.GetEncoding(charset);

    byte[] bar;

    if (encoding.ToLower().Equals("q"))
    {
        bar = m_QPDecoder.Decode(ref text);
    }
    else
    {
        bar = m_B64decoder.Decode(ref text);
    }                    
    text = enc.GetString(bar);

    field.Body = Regex.Replace(field.Body,
        m_EncodedWordPattern.TextPattern, text);
    field.Body = field.Body.Replace('_', ' ');
}

Let's pretend that the regex pattern strings are correct in their definitions (because they are god-awful to read and I can't be bothered to double-check them), the replacing of '_' with a space is wrong (it should only be done in the "q" case) and the Regex.Replace() is just evil. Not to mention that there could be multiple encoded-words per field.Body which this code utterly fails to handle.

Guys. I know you love regular expressions and that they are very very useful, but they are no substitute for writing a real tokenizer. This is especially true if you want to be lenient in what you accept (and in the case of MIME, you really need to be).

A Moment of Zen