Sunday, September 15, 2013

Time for a rant on mime parsers...

Warning: Viewer discretion is advised.

Where should I begin?

I guess I should start by saying that I am obsessed with MIME and, in particular, MIME parsers. No, really. I am obsessed. Don't believe me? I've written and/or worked on several MIME parsers at this point. It started off in my college days working on Spruce which had a horrendously bad MIME parser, and so as you read farther along in my rant about shitty MIME parsers, keep in mind: I've been there, I've written a shitty MIME parser.

As a handful of people are aware, I've recently started implementing a C# MIME parser called MimeKit. As I work on this, I've been searching around on GitHub and Google to see what other MIME parsers exist out there to find out what sort of APIs they provide. I thought perhaps I'll find one that offers a well-designed API that will inspire me. Perhaps, by some miracle, I'd find one that was actually pretty good that I could just contribute to instead of writing my own from scratch (yea, wishful thinking). Instead, all I have found are poorly designed and implemented MIME parsers, many probably belong on the front page of the Daily WTF.

I guess I'll start with some softballs.

First, there's the fact that every single one of them were written as System.String parsers. Don't be fooled by the ones claiming to be "stream parsers", because all any of those did was to slap a TextReader on top of the byte stream and start using reader.ReadLine(). What's so bad about that, you ask? For those not familiar with MIME, I'd like for you to take a look at the raw email sources in your inboxes particularly if you have correspondence with anyone outside of the US. Hopefully most of your friends and colleagues are using more-or-less MIME compliant email clients, but I guarantee you'll find at least a few emails with raw 8bit text.

Now, if the language they were using was C or C++, they might be able to get away with doing this because they'd technically be operating on byte arrays, but with Java and C#, a 'string' is a unicode string. Tell me: how does one get a unicode string from a raw byte array?

Bingo. You need to know the charset before you can convert those bytes into unicode characters.

To be fair, there's really no good way of handling raw 8bit text in message headers, but by using a TextReader approach, you are really limiting the possibilities.

Next up is the ReadLine() approach. One of the 2 early parsers in GMime (pan-mime-parser.c back in the version 0.7 days) used a ReadLine() approach, so I understand the thinking behind this. And really, there's nothing wrong with this approach as far as correctness goes, it's more of a "this can never be fast" complaint. Of the two early parsers in GMime, the pan-mime-parser.c backend was horribly slow compared to the in-memory parser. Of course, that's not very surprising. More surprising to me at the time was that when I wrote GMime's current generation of parser (sometime between v0.7 and v1.0), it was just as fast as the in-memory parser ever was and only ever had up to 4k in a read buffer at any given time. My point is, there are far better approaches than ReadLine() if you want your parser to be reasonably performant... and why wouldn't you want that? Your users definitely want that.

Okay, now come the more serious problems that I encountered in nearly all of the mime parser libraries I found.

I think that every single mime parser I've found so far uses the "String.Split()" approach for parsing address headers and/or for parsing parameter lists on headers such as Content-Type and Content-Disposition.

Here's an example from one C# MIME parser:

string[] emails = addressHeader.Split(',');

Here's how this same parser decodes encoded-word tokens:

private static void DecodeHeaders(NameValueCollection headers)
{
    ArrayList tmpKeys = new ArrayList(headers.Keys);

    foreach (string key in headers.AllKeys)
    {
        //strip qp encoding information from the header if present
        headers[key] = Regex.Replace(headers[key].ToString(), @"=\?.*?\?Q\?(.*?)\?=",
            new MatchEvaluator(MyMatchEvaluator), RegexOptions.IgnoreCase | RegexOptions.Multiline);
        headers[key] = Regex.Replace(headers[key].ToString(), @"=\?.*?\?B\?(.*?)\?=",
            new MatchEvaluator(MyMatchEvaluatorBase64), RegexOptions.IgnoreCase | RegexOptions.Multiline);
    }
}

private static string MyMatchEvaluator(Match m)
{
    return DecodeQP(m.Groups[1].Value);
}

private static string MyMatchEvaluatorBase64(Match m)
{
    System.Text.Encoding enc = System.Text.Encoding.UTF7;
    return enc.GetString(Convert.FromBase64String(m.Groups[1].Value));
}

Excuse my language, but what the fuck? It completely throws away the charset in each of those encoded-word tokens. In the case of quoted-printable tokens, it assumes they are all ASCII (actually, latin1 may work as well?) and in the case of base64 encoded-word tokens, it assumes they are all in UTF-7!?!? Where in the world did he get that idea? I can't begin to imagine his code working on any base64 encoded-word tokens in the real world. If anything is deserving of a double facepalm, this is it.

I'd just like to point out that this is what this project's description states:

A small, efficient, and working mime parser library written in c#.
...
I've used several open source mime parsers before, but they all either
fail on one kind of encoding or the other, or miss some crucial
information. That's why I decided to finally have a go at the problem
myself.

I'll grant you that his MIME parser is small, but I'd have to take issue with the "efficient" and "working" adjectives. With the heavy use of string allocations and regex matching, it could hardly be considered "efficient". And as the code pointed out above illustrates, "working" is a bit of an overstatement.

Folks... this is what you get when you opt for a "lightweight" MIME parser because you think that parsers like GMime are "bloated".

On to parser #2... I like to call this the "Humpty Dumpty" approach:

public static StringDictionary parseHeaderFieldBody ( String field, String fieldbody ) {
    if ( fieldbody==null )
        return null;
    // FIXME: rewrite parseHeaderFieldBody to being regexp based.
    fieldbody = SharpMimeTools.uncommentString (fieldbody);
    StringDictionary fieldbodycol = new StringDictionary ();
    String[] words = fieldbody.Split(new Char[]{';'});
    if ( words.Length>0 ) {
        fieldbodycol.Add (field.ToLower(), words[0].ToLower().Trim());
        for (int i=1; i<words.Length; i++ ) {
            String[] param = words[i].Trim(new Char[]{' ', '\t'}).Split(new Char[]{'='}, 2);
            if ( param.Length==2 ) {
                param[0] = param[0].Trim(new Char[]{' ', '\t'});
                param[1] = param[1].Trim(new Char[]{' ', '\t'});
                if ( param[1].StartsWith("\"") && !param[1].EndsWith("\"")) {
                    do {
                        param[1] += ";" + words[++i];
                    } while ( !words[i].EndsWith("\"") && i<words.Length);
                }
                fieldbodycol.Add ( param[0], SharpMimeTools.parserfc2047Header (param[1].TrimEnd(';').Trim('\"', ' ')) );
            }
        }
    }
    return fieldbodycol;
}

I'll give this guy some credit, at least he saw that his String.Split() approach was flawed and so tried to compensate by piecing Humpty Dumpty back together again. Of course, with his String.Trim()ing, he just won't be able to put him back together again with any level of certainty. The white space in those quoted tokens may have significant meaning.

Many of the C# MIME parsers out there like to use Regex all over the place. Here's a snippet from one parser that is entirely written in Regex (yea, have fun maintaining that...):

if (m_EncodedWordPattern.RegularExpression.IsMatch(field.Body))
{
    string charset = m_CharsetPattern.RegularExpression.Match(field.Body).Value;
    string text = m_EncodedTextPattern.RegularExpression.Match(field.Body).Value;
    string encoding = m_EncodingPattern.RegularExpression.Match(field.Body).Value;

    Encoding enc = Encoding.GetEncoding(charset);

    byte[] bar;

    if (encoding.ToLower().Equals("q"))
    {
        bar = m_QPDecoder.Decode(ref text);
    }
    else
    {
        bar = m_B64decoder.Decode(ref text);
    }                    
    text = enc.GetString(bar);

    field.Body = Regex.Replace(field.Body,
        m_EncodedWordPattern.TextPattern, text);
    field.Body = field.Body.Replace('_', ' ');
}

Let's pretend that the regex pattern strings are correct in their definitions (because they are god-awful to read and I can't be bothered to double-check them), the replacing of '_' with a space is wrong (it should only be done in the "q" case) and the Regex.Replace() is just evil. Not to mention that there could be multiple encoded-words per field.Body which this code utterly fails to handle.

Guys. I know you love regular expressions and that they are very very useful, but they are no substitute for writing a real tokenizer. This is especially true if you want to be lenient in what you accept (and in the case of MIME, you really need to be).

8 comments:

Anonymous said...

I always love a good MIME parser rant in the morning from fejj ;-)

evgeny said...

Well, after this post, it is a must to push a follow up article on how to design and implement a good MIME parser!:-)

Anonymous said...

This way: https://github.com/jstedfast/MimeKit. I agree with fejj, although I also think that the MIME parser should or could aid with BODYSTRUCTURE parsing, and allow creating a Message structure by pouring in the data of unfetched MIME parts after having parsed a BODYSTRUCTURE or BODY. I have not found such support in either GMime or MimeKit and I think it would aid an E-mail client or component developer a lot. I also think MimeMessage should subclass MessagePart. I'm sure that if I'd start working with GMime and/or MimeKit that I'd have other overall design/API comments that are less related to the actual parsing - he tokenized stream parsing based way (or whatever one must call it) laid out by fejj in earlier posts makes a lot of sense. Maybe someday if I'm ever crazy enough to get back into doing E-mail stuff.

Sandro Magi said...

My Sasa library contains a MIME parser (Sasa.Net.Mail.Message). Not sure if you were aware of it when writing this article. I have no illusions that it's fully compliant or as rigourous as you'd like it to be, so trash away! :-)

All I can say is that I've been using it for over 7 years now, and processed over 50,000 messages, including foreign messages with encoded subject lines and addresses, and the current versions handles all the cases I've encountered so far.

I have sinned in parsing from a System.String, but if you could provide a specific example in which this would fail, I'd love to take a look and expand my understanding of this domain.

Unknown said...

Hi Sandro,

Parsing System.String isn't the unholiest of sins and it won't blow up or anything, it's more that you are just limiting the potential of your MIME parser to never be able to handle 8bit text in headers or Content-Transfer-Encoding: 8bit.

I don't have an example message handy here at the office, but imagine a message where you have some a Japanese sender who's mail client didn't properly encode his name (raw Shift-JIS). Then, you have a mixed array of German (iso-8859-1), Chinese (big5 or gb18030), Korean (euc-kr), and Russian (windows-1251 or koi8-r, probably) recipients who's names are each in their respective charset encodings and none of them got properly rfc2047 encoded. The Subject, again, is in Shift-JIS and the body of the message is in utf-8 and a Content-Transfer-Encoding: 8bit.

How does your parser deal with that? No matter what charset you set on the StreamReader, you lose.

Probably not a super realistic example, sure, but not impossible. A more realistic example might be an email with raw 8bit headers in one charset and the message body in another with Content-Transfer-Encoding: 8bit.

This time you only have 2 charsets to deal with but your StreamReader can only deal with 1. If you pick 1 charset to use with your StreamReader, it may have to drop illegal byte sequences for that charset as it converts to unicode strings that it feeds to your parser and so you'll never be able to convert back in order to try the other charset for the body.

I've only taken a cursory look at Sasa.Net.Mail.Message so far, so there might be other issues, but your DecodeWord() implementation shouldn't be replacing _'s with spaces if the base64 encoding was used - that feature is only for quoted-printable encoded payloads.

Your EncodeWord() implementation is also wrong. It needs to encode more than just non-ascii characters - there is a list of encoded-word specials that it also needs to encode defined in rfc2047 and rfc*822, depending on whether the value is text from an unstructured header or a phrase from a structured header field (such as email address names). It also needs to put linear whitespace between tokens. It looks to me like your encoder will produce things like this:

To: blah=?utf-8?B?ALKFWKKiwr728?=blah

That is illegal according to the rfc. encoded-words *MUST* be parseable as distinct word tokens.

You might be interested in read my previous blog post about the corner-cases you should really handle in your rfc2047 decoders.

Sandro Magi said...

Thanks for the notes Jeff. By, "recipients who's names are each in their respective charset encodings and none of them got properly rfc2047 encoded", are these malformed messages, or are they still considered well formed?

I think I understand in principle what you mean though. To summarize, the octet stream representing a MIME messages may switch encodings on the fly, in which case no matter how you read the octets into a string using system readers, the result will be messed up unless your reader is itself MIME-aware. Yikes.

If these are well-formed messages, then I'll definitely run into them at some point. These sound like exceptional cases though. Maybe I'll try to mine a trove of Asian spam messages and see how Sasa does with those.

Re: DecodeWord feedback, I remember I added that specifically because I ran into that case, but I could have incorrectly generalized. I'll take a look again, thanks.

Re: EncodeWord feedback, I believe I am treating every encoded word as unstructured text. Didn't realize they had slightly different rules for structured headers and unstructured text, so I'll take a second look. Thanks again!

Unknown said...

They are definitely considered "malformed" :-)

> I think I understand in principle what you mean though. To summarize, the octet stream representing a MIME messages may switch encodings on the fly, in which case no matter how you read the octets into a string using system readers, the result will be messed up unless your reader is itself MIME-aware. Yikes.

Correct.

> Thanks again!

Glad to help. Feel free to take a look at MimeKit's source on GitHub if you think ti might be of help.

I've mostly been mind-dumping so far, so my comments are sparse in some areas that I need to go back to and clean up a bit. Feel free to poke me on github or via email if you have questions.

I'd love to see more mime parsers robustificate and handle more edge cases better.

Guy G. said...

As soon as I would see string.ToLower().Equals(...) I'd know it's a shitty parser. Most C# programmers don't know how to compare strings the right way...

Code Snippet Licensing

All code posted to this blog is licensed under the MIT/X11 license unless otherwise stated in the post itself.