A Moment of Zen: rants

Showing posts with label rants. Show all posts

Sunday, September 15, 2013

Time for a rant on mime parsers...

Warning: Viewer discretion is advised.

Where should I begin?

I guess I should start by saying that I am obsessed with MIME and, in particular, MIME parsers. No, really. I am obsessed. Don't believe me? I've written and/or worked on several MIME parsers at this point. It started off in my college days working on Spruce which had a horrendously bad MIME parser, and so as you read farther along in my rant about shitty MIME parsers, keep in mind: I've been there, I've written a shitty MIME parser.

As a handful of people are aware, I've recently started implementing a C# MIME parser called MimeKit. As I work on this, I've been searching around on GitHub and Google to see what other MIME parsers exist out there to find out what sort of APIs they provide. I thought perhaps I'll find one that offers a well-designed API that will inspire me. Perhaps, by some miracle, I'd find one that was actually pretty good that I could just contribute to instead of writing my own from scratch (yea, wishful thinking). Instead, all I have found are poorly designed and implemented MIME parsers, many probably belong on the front page of the Daily WTF.

I guess I'll start with some softballs.

First, there's the fact that every single one of them were written as System.String parsers. Don't be fooled by the ones claiming to be "stream parsers", because all any of those did was to slap a TextReader on top of the byte stream and start using reader.ReadLine(). What's so bad about that, you ask? For those not familiar with MIME, I'd like for you to take a look at the raw email sources in your inboxes particularly if you have correspondence with anyone outside of the US. Hopefully most of your friends and colleagues are using more-or-less MIME compliant email clients, but I guarantee you'll find at least a few emails with raw 8bit text.

Now, if the language they were using was C or C++, they might be able to get away with doing this because they'd technically be operating on byte arrays, but with Java and C#, a 'string' is a unicode string. Tell me: how does one get a unicode string from a raw byte array?

Bingo. You need to know the charset before you can convert those bytes into unicode characters.

To be fair, there's really no good way of handling raw 8bit text in message headers, but by using a TextReader approach, you are really limiting the possibilities.

Next up is the ReadLine() approach. One of the 2 early parsers in GMime (pan-mime-parser.c back in the version 0.7 days) used a ReadLine() approach, so I understand the thinking behind this. And really, there's nothing wrong with this approach as far as correctness goes, it's more of a "this can never be fast" complaint. Of the two early parsers in GMime, the pan-mime-parser.c backend was horribly slow compared to the in-memory parser. Of course, that's not very surprising. More surprising to me at the time was that when I wrote GMime's current generation of parser (sometime between v0.7 and v1.0), it was just as fast as the in-memory parser ever was and only ever had up to 4k in a read buffer at any given time. My point is, there are far better approaches than ReadLine() if you want your parser to be reasonably performant... and why wouldn't you want that? Your users definitely want that.

Okay, now come the more serious problems that I encountered in nearly all of the mime parser libraries I found.

I think that every single mime parser I've found so far uses the "String.Split()" approach for parsing address headers and/or for parsing parameter lists on headers such as Content-Type and Content-Disposition.

Here's an example from one C# MIME parser:

string[] emails = addressHeader.Split(',');

Here's how this same parser decodes encoded-word tokens:

private static void DecodeHeaders(NameValueCollection headers)
{
    ArrayList tmpKeys = new ArrayList(headers.Keys);

    foreach (string key in headers.AllKeys)
    {
        //strip qp encoding information from the header if present
        headers[key] = Regex.Replace(headers[key].ToString(), @"=\?.*?\?Q\?(.*?)\?=",
            new MatchEvaluator(MyMatchEvaluator), RegexOptions.IgnoreCase | RegexOptions.Multiline);
        headers[key] = Regex.Replace(headers[key].ToString(), @"=\?.*?\?B\?(.*?)\?=",
            new MatchEvaluator(MyMatchEvaluatorBase64), RegexOptions.IgnoreCase | RegexOptions.Multiline);
    }
}

private static string MyMatchEvaluator(Match m)
{
    return DecodeQP(m.Groups[1].Value);
}

private static string MyMatchEvaluatorBase64(Match m)
{
    System.Text.Encoding enc = System.Text.Encoding.UTF7;
    return enc.GetString(Convert.FromBase64String(m.Groups[1].Value));
}

Excuse my language, but what the fuck? It completely throws away the charset in each of those encoded-word tokens. In the case of quoted-printable tokens, it assumes they are all ASCII (actually, latin1 may work as well?) and in the case of base64 encoded-word tokens, it assumes they are all in UTF-7!?!? Where in the world did he get that idea? I can't begin to imagine his code working on any base64 encoded-word tokens in the real world. If anything is deserving of a double facepalm, this is it.

I'd just like to point out that this is what this project's description states:

A small, efficient, and working mime parser library written in c#.
...
I've used several open source mime parsers before, but they all either
fail on one kind of encoding or the other, or miss some crucial
information. That's why I decided to finally have a go at the problem
myself.

I'll grant you that his MIME parser is small, but I'd have to take issue with the "efficient" and "working" adjectives. With the heavy use of string allocations and regex matching, it could hardly be considered "efficient". And as the code pointed out above illustrates, "working" is a bit of an overstatement.

Folks... this is what you get when you opt for a "lightweight" MIME parser because you think that parsers like GMime are "bloated".

On to parser #2... I like to call this the "Humpty Dumpty" approach:

public static StringDictionary parseHeaderFieldBody ( String field, String fieldbody ) {
    if ( fieldbody==null )
        return null;
    // FIXME: rewrite parseHeaderFieldBody to being regexp based.
    fieldbody = SharpMimeTools.uncommentString (fieldbody);
    StringDictionary fieldbodycol = new StringDictionary ();
    String[] words = fieldbody.Split(new Char[]{';'});
    if ( words.Length>0 ) {
        fieldbodycol.Add (field.ToLower(), words[0].ToLower().Trim());
        for (int i=1; i<words.Length; i++ ) {
            String[] param = words[i].Trim(new Char[]{' ', '\t'}).Split(new Char[]{'='}, 2);
            if ( param.Length==2 ) {
                param[0] = param[0].Trim(new Char[]{' ', '\t'});
                param[1] = param[1].Trim(new Char[]{' ', '\t'});
                if ( param[1].StartsWith("\"") && !param[1].EndsWith("\"")) {
                    do {
                        param[1] += ";" + words[++i];
                    } while ( !words[i].EndsWith("\"") && i<words.Length);
                }
                fieldbodycol.Add ( param[0], SharpMimeTools.parserfc2047Header (param[1].TrimEnd(';').Trim('\"', ' ')) );
            }
        }
    }
    return fieldbodycol;
}

I'll give this guy some credit, at least he saw that his String.Split() approach was flawed and so tried to compensate by piecing Humpty Dumpty back together again. Of course, with his String.Trim()ing, he just won't be able to put him back together again with any level of certainty. The white space in those quoted tokens may have significant meaning.

Many of the C# MIME parsers out there like to use Regex all over the place. Here's a snippet from one parser that is entirely written in Regex (yea, have fun maintaining that...):

if (m_EncodedWordPattern.RegularExpression.IsMatch(field.Body))
{
    string charset = m_CharsetPattern.RegularExpression.Match(field.Body).Value;
    string text = m_EncodedTextPattern.RegularExpression.Match(field.Body).Value;
    string encoding = m_EncodingPattern.RegularExpression.Match(field.Body).Value;

    Encoding enc = Encoding.GetEncoding(charset);

    byte[] bar;

    if (encoding.ToLower().Equals("q"))
    {
        bar = m_QPDecoder.Decode(ref text);
    }
    else
    {
        bar = m_B64decoder.Decode(ref text);
    }                    
    text = enc.GetString(bar);

    field.Body = Regex.Replace(field.Body,
        m_EncodedWordPattern.TextPattern, text);
    field.Body = field.Body.Replace('_', ' ');
}

Let's pretend that the regex pattern strings are correct in their definitions (because they are god-awful to read and I can't be bothered to double-check them), the replacing of '_' with a space is wrong (it should only be done in the "q" case) and the Regex.Replace() is just evil. Not to mention that there could be multiple encoded-words per field.Body which this code utterly fails to handle.

Guys. I know you love regular expressions and that they are very very useful, but they are no substitute for writing a real tokenizer. This is especially true if you want to be lenient in what you accept (and in the case of MIME, you really need to be).

Sunday, August 11, 2013

Why decoding rfc2047-encoded headers is hard

Somewhat inspired by a recent thread on the notmuch mailing-list, I thought I'd explain why decoding headers is so hard to get right. I'm sure just about every developer who has ever worked on an email client could tell you this, but I guess I'm going to be the one to do it.

Here's just a short list of the problems every developer faces when they go to implement a decoder for headers which have been (theoretically) encoded according to the rfc2047 specification:

First off, there are technically two variations of header encoding formats specified by rfc2047 - one for phrases and one for unstructured text fields. They are very similar but you can't use the same rules for tokenizing them. I mention this because it seems that most MIME parsers miss this very subtle distinction and so, as you might imagine, do most MIME generators. Hell, most MIME generators probably never even heard of specifications to begin with it seems.

This brings us to:
There are so many variations of how MIME headers fail to be tokenizable according to the rules of rfc2822 and rfc2047. You'll encounter fun stuff such as:
1. encoded-word tokens illegally being embedded in other word tokens
2. encoded-word tokens containing illegal characters in them (such as spaces, line breaks, and more) effectively making it so that a tokenizer can no longer, well, tokenize them (at least not easily)
3. multi-byte character sequences being split between multiple encoded-word tokens which means that it's not possible to decode said encoded-word tokens individually
4. the payloads of encoded-word tokens being split up into multiple encoded-word tokens, often splitting in a location which makes it impossible to decode the payload in isolation
You can see some examples here.
Something that many developers seem to miss is the fact that each encoded-word token is allowed to be in different character encodings (you might have one token in UTF-8, another in ISO-8859-1 and yet another in koi8-r). Normally, this would be no big deal because you'd just decode each payload, then convert from the specified charset into UTF-8 via iconv() or something. However, due to the fun brokenness that I mentioned above in (2c) and (2d), this becomes more complicated.

If that isn't enough to make you want to throw your hands up in the air and mutter some profanities, there's more...
Undeclared 8bit text in headers. Yep. Some mailers just didn't get the memo that they are supposed to encode non-ASCII text. So now you get to have the fun experience of mixing and matching undeclared 8bit text of God-only-knows what charset along with the content of (probably broken) encoded-words.

That said, I was able to help the notmuch developers solve this problem by letting them know about the GMIME_ENABLE_RFC2047_WORKAROUNDS flag that they could pass to g_mime_init(guint32 flags).

Any developer reading this blog post and thinking that they want to see how this is done in GMime, the source code for the rfc2047 decoder is located here. If the line numbers change in the future, just grep around for "rfc2047_token" and you should find it.

In other news... I cranked out a ton more code for MimeKit (my C# MIME parser library) yesterday. Yes, I know... I've got a serious problem with masochism having already written 2 MIME parsers and now I'm working on a third. When will the hurting stop? Never!

Oh, I guess I could point people at MimeKit's rfc2047 decoders. What you'll want to look at is MimeKit.Rfc2047.DecodePhrase(byte[] phrase) and MimeKit.Rfc2047.DecodeText(byte[] text).

Tuesday, January 18, 2011

I am Disappoint: No Love for Froyo on Galaxy S

Based on an anonymous post on the XDA Developer Forums, the reason behind the lack of a Froyo update for Samsung Galaxy S phones in the US appears to be because Samsung is greedy.

The following quote is the entirety of the message as it appears on the forums for your convenience (with added emphasis by me).

Hello,
I’m going to step across the NDAs and explain the issues behind the Android Froyo update to Samsung Galaxy S phones in the United States. I think most of you have come to this realization yourself now: the withholding of the Froyo update is a largely political one, not a technological one: Froyo runs quite well on Galaxy S phones, as those of you that have run leaked updates may have noticed.
To explain the political situation, first, a primer on how phone firmware upgrades work for carriers. When a carrier decides to sell a phone, a contract is usually written between the phone manufacturer and the carrier. In this contract, the cost of updates (to the carrier) is usually outlined. Updates are usually broken into several types: critical updates, maintenance updates, and feature updates. Critical updates are those that resolve a critical bug in the phone, such as the phone overheating. Maintenance updates involve routine updates to resolve bugs and other issues reported by the carrier. Finally, feature updates add some new feature in software that wasn’t present before. Critical updates are usually free, maintenance updates have some maintenance fee associated with them, and feature updates are usually costly.
In the past, most phone updates would mainly consist of critical and maintenance updates. Carriers almost never want to incur the cost of a feature update because it is of little benefit to them, adds little to the device, and involves a lot of testing on the carrier end. Android has changed the playing field, however – since the Android Open Source Project is constantly being updated, and that information being made widely available to the public, there is pressure for the phone to be constantly updated with the latest version of Android. With most manufacturers, such as HTC, Motorola, etc. This is fine and considered a maintenance upgrade. Samsung, however, considers it a feature update, and requires carriers to pay a per device update fee for each incremental Android update.
Now, here’s where the politics come in: most U.S. carriers aren’t very happy with Samsung’s decision to charge for Android updates as feature updates, especially since they are essentially charging for the Android Open Source Project’s efforts, and the effort on Samsung’s end is rather minimal. As a result of perhaps, corporate collusion, all U.S. carriers have decided to refuse to pay for the Android 2.2 update, in hopes that the devaluation of the Galaxy S line will cause Samsung to drop their fees and give the update to the carriers. The situation has panned out differently in other parts of the world, but this is the situation in the United States.
Some of you might have noticed Verion’s Fascinate updated, but without 2.2 : This is a result of a maintenance agreement Samsung must honor combined with Verizon’s unwillingness to pay the update fees. In short, Android 2.2 is on hold for Galaxy S phones until the U.S. carriers and Samsung reach a consensus.
Some might wonder why I didn’t deliver this over a more legitimate news channel – the short answer: I don’t want to lose my job. I do, however, appreciate transparency, which is why I'm here.

Having bought a Samsung Galaxy S phone back in August with the expectation that it would get the Froyo update soon, I am disappoint.

This is just one more annoyance added to my growing list of annoyances about Android-based phones and my Galaxy S phone in particular.

Bitch and moan about Apple being evil all you want, but even Apple doesn't do this to their users. I will never ever buy another Samsung Android phone again. This really rubs me the wrong way.

Update: People have been commenting that Android devices have gotten better OS update support than iPhones. This is simply not the case. The iPhone 3G came out ~6 months before the Android G1 and the G1 stopped getting updates (latest update was Android 1.6) looooooooong before the iPhone 3G stopped getting updates. Apple at least kept providing updates to the 3G through iOS 4.1.x, latest update being this past fall. So even though the iPhone 3G is older than the G1, it got OS updates until long after updates stopped coming for the G1. My point is that to even compare the G1 to the iPhone 3G in terms of time supported by new OS upgrades, the G1 would have to at least have Android 2.2 (which came out how long ago?? I mean, even 2.3 is out now). In other words: Apple pushes all OS upgrades for their iPhones for at least 2 years (length of a contract) while no Android handset maker ever has - the G1 got OS upgrades for what? 6 months?

Wednesday, June 25, 2008

PulseAudio: A Solution In Search of a Problem

I upgraded the Linux OS on my laptop yesterday so that I had a more up-to-date GNOME installation such that I could actually build Evolution from SVN and hack it a bit to solve some annoyances I've been having. However, this is for another blog post.

So... PulseAudio. W. T. F.

All day yesterday I'd been wondering why my sound volume was so low even though I had cranked it all the way up to 100%. Since I was mostly focused getting work done yesterday, I didn't spend any time investigating until last night when I was forced to do so.

I don't remember the exact order of things, but I did watch an episode of Harsh Realm in Xine while also IM'ing some friends in Pidgin, but the sound was so low I could barely hear anything. So a friend on IRC suggested I kill pulseaudio and restart it as this supposedly would help fix the "wonkyness" (evidently I'm not the only one with PulseAudio problems).

This worked for a while, long enough to play the DVD iirc, but afterward I had no sound. So I thought "well, I guess this is just more PulseAudio 'wonkyness', so let me kill pulseaudio again". This time when I killed it, my sound card got flooded with every audio event that had happened over the hour or so I was running Xine. Yay. (that was a sarcastic yay in case you couldn't tell)

Audio worked again for a bit. At one point, a friend IM's me a link so I clicked it and no browser started up (which seemed a bit odd). Tried it again, still nothing. So I figured "ok, I'll just launch firefox from my gnome-terminal and see if that prints any errors". No errors, but it did just hang. Oh goodie. Ctrl-C'd and killed pulseaudio again. At this point my friend IM'd me again which made Pidgin completely lock up:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
26422 fejj      20   0 3054m 1.3g  27m D  198 33.2  18:55.68 pidgin

So the issues of memory usage are perfect for another rant blog post, but the point is that pidgin is completely locked. I can't even kill -9 it (I tried multiple times).

I had to Ctrl+Alt+F1, login as root, and `init 3`

At this point I was so frustrated that I uninstalled each and every package with pulseaudio in the name before going back to X.

Low and behold, after that my audio works flawlessly (well, the Pidgin IM audio events make some crackling which I don't recall happening before I reinstalled, but I can live with it - especially since other audio seems crisp).

So yea... the stuff that the Linux Hater says is all true. He even complained about audio under Linux a while back (couldn't remember which entry it was with a quick look over the entry names).

The point of this story is that PulseAudio appears to be a solution in search of a problem (quite common in the land of Linux afaict).

As I was bitching about this to a hacker friend of mine, he asked me "isn't PulseAudio for sound mixing or something? So multiple programs can all play audio at the same time?".

Nope, because ALSA seemed to do mixing just fine for me in the past.

A quick Google search for PulseAudio reveals that it is supposed to allow fancy stuff like redirecting sound to another host machine for playing.

...just as I suspected: a solution in search of a problem.

Seriously though, who the hell needs that feature on their desktop/laptop machines? No one, that's who. The only people who might need that are thin clients - so why install it by default? Why not just have the sysadmins for those thin-client networks set this up?

PulseAudio is a clusterfuck for basic desktop audio needs afaict.

Sigh.

Update: Apparently I offended the PulseAudio developer with my "it's a solution in search of a problem" statement. For that, I apologize. A number of people have explained what exactly it is supposed to be solving. Unfortunately, none of the problems it is solving seem to be problems I need solved (maybe my sound cards in all my machines have hardware mixing support? I don't know. Or maybe dmix is plenty good enough for me).

However, I still get the feeling that PA is not quite ready for wide desktop adoption because people on other distros are having some of the same problems I had and this is not good.

I'd expect a similar rant from people if Evolution's current IMAP provider was replaced with my IMAP4 provider. As nice as my IMAP4 provider replacement is for my usage, there are no doubt new problems that it might currently have over the old implementation that would break things for some people. (Hence why I have not pushed for it to replace the old IMAP provider until I'm sure it is ready.)

Also, this was tagged as a rant. I had every right to be frustrated because things weren't working like they have worked Just Fine(tm) for me for years. I've had to put up with far worse personal attacks on myself and software I've been involved with over the years than anything I said on this blog (and lets face it, this was not a personal attack aimed at the PA devs - it was just me venting my frustration with a piece of software), so I think some people took this post way too personally.

A Moment of Zen