Somewhat inspired by a recent thread on the notmuch mailing-list, I thought I'd explain why decoding headers is so hard to get right. I'm sure just about every developer who has ever worked on an email client could tell you this, but I guess I'm going to be the one to do it.
Here's just a short list of the problems every developer faces when they go to implement a decoder for headers which have been (theoretically) encoded according to the rfc2047 specification:
- First off, there are technically two variations of header encoding formats specified by rfc2047 - one for phrases and one for unstructured text fields. They are very similar but you can't use the same rules for tokenizing them. I mention this because it seems that most MIME parsers miss this very subtle distinction and so, as you might imagine, do most MIME generators. Hell, most MIME generators probably never even heard of specifications to begin with it seems.
This brings us to:
- There are so many variations of how MIME headers fail to be tokenizable according to the rules of rfc2822 and rfc2047. You'll encounter fun stuff such as:
- encoded-word tokens illegally being embedded in other word tokens
- encoded-word tokens containing illegal characters in them (such as spaces, line breaks, and more) effectively making it so that a tokenizer can no longer, well, tokenize them (at least not easily)
- multi-byte character sequences being split between multiple encoded-word tokens which means that it's not possible to decode said encoded-word tokens individually
- the payloads of encoded-word tokens being split up into multiple encoded-word tokens, often splitting in a location which makes it impossible to decode the payload in isolation
You can see some examples here.
- Something that many developers seem to miss is the fact that each encoded-word token is allowed to be in different character encodings (you might have one token in UTF-8, another in ISO-8859-1 and yet another in koi8-r). Normally, this would be no big deal because you'd just decode each payload, then convert from the specified charset into UTF-8 via iconv() or something. However, due to the fun brokenness that I mentioned above in (2c) and (2d), this becomes more complicated.
If that isn't enough to make you want to throw your hands up in the air and mutter some profanities, there's more...
- Undeclared 8bit text in headers. Yep. Some mailers just didn't get the memo that they are supposed to encode non-ASCII text. So now you get to have the fun experience of mixing and matching undeclared 8bit text of God-only-knows what charset along with the content of (probably broken) encoded-words.
That said, I was able to help the notmuch developers solve this problem by letting them know about the GMIME_ENABLE_RFC2047_WORKAROUNDS flag that they could pass to g_mime_init(guint32 flags).
Any developer reading this blog post and thinking that they want to see how this is done in GMime, the source code for the rfc2047 decoder is located here. If the line numbers change in the future, just grep around for "rfc2047_token" and you should find it.
In other news... I cranked out a ton more code for MimeKit (my C# MIME parser library) yesterday. Yes, I know... I've got a serious problem with masochism having already written 2 MIME parsers and now I'm working on a third. When will the hurting stop? Never!
Oh, I guess I could point people at MimeKit's rfc2047 decoders. What you'll want to look at is MimeKit.Rfc2047.DecodePhrase(byte[] phrase) and MimeKit.Rfc2047.DecodeText(byte[] text).