Code Snippet Licensing

All code posted to this blog is licensed under the MIT/X11 license unless otherwise stated in the post itself.

Saturday, May 17, 2008

Debian Language Benchmarks - SumFile

Since someone brought it up the other day on IRC, I figured I'd take a look at some of the Java vs C# benchmarks where Java was claimed to be faster than C#.

Scrolling through the list, there were 3 tests where Java was supposedly more than 2x faster than C# on Mono.

The SumFile test looked fairly simple and I figured the bottleneck there might be related to something silly being done with StandardInput (Miguel replaced the Int.Parse() routines with my more optimal implementations a while back, so I figured that was probably ok).

The first thing I tried to do was reproduce their findings, so I grabbed the Java and C# sources from the Debian site and compiled them.

Both programs read from StandardInput an integer value per line, so I wrote a simple program to generate some input for the test:


#include <stdio.h>
#include <stdlib.h>

int main (int argc, char **argv)
{
unsigned int max, i;
int sum = 0;

max = atoi (argv[1]);

for (i = 0; i <= max; i++) {
printf ("%u\n", i);
sum += i;
}

fprintf (stderr, "master value: %d\n", sum);

return 0;
}


As you can see above, the program outputs the real sum of all the digits it is outputting to stderr so that we can compare the results of the C# and Java programs to make sure they are correct for 32bit signed integers.

Then I compiled the Java and C# versions using the same compile options mentioned on the Debian Language Shootout pages for each language:

javac -classpath '.' sumcol.java

and

gmcs sumcol.cs


I then ran each program using the following commands (using the same java and mono command-line options):


[fejj@moonlight benchmarks]$ time ./output 5000000 | java -server -Xbatch sumcol
master value: 1647668640
2147483647

real 0m4.606s
user 0m5.264s
sys 0m0.276s
[fejj@moonlight benchmarks]$ time ./output 5000000 | mono sumcol.exe
master value: 1647668640
1647668640

real 0m1.415s
user 0m2.136s
sys 0m0.228s


As you can see above, there's no way that Java is 2.3x faster than Mono... so I'm not sure where they are getting these results from. Not only that, but Java is getting the wrong answer!

My best guess is that Java does not allow integer wrapping, so I suppose that that is okay... but it's a bit odd.

For comparison's sake, here's what I get with the c implementation of sumfile:


[fejj@moonlight benchmarks]$ time ./output 5000000 | ./sumcol
master value: 1647668640
1647668640

real 0m1.043s
user 0m1.604s
sys 0m0.188s


Thinking that Java might be taking a performance hit from the bounds checking, I changed my original number-generating program to always print '1' on each line:


#include <stdio.h>
#include <stdlib.h>

int main (int argc, char **argv)
{
unsigned int i, max;
int sum = 0;

max = atoi (argv[1]);

for (i = 0; i < max; i++) {
printf ("1\n");
sum++;
}

fprintf (stderr, "master value: %d\n", sum);

return 0;
}


Running the C, Mono, and Java versions again, I get the following results:


[fejj@moonlight benchmarks]$ time ./output 5000000 | ./sumcol
master value: 5000000
5000000

real 0m0.601s
user 0m0.828s
sys 0m0.032s
[fejj@moonlight benchmarks]$ time ./output 5000000 | mono sumcol.exe
master value: 5000000
5000000

real 0m0.774s
user 0m0.876s
sys 0m0.064s
[fejj@moonlight benchmarks]$ time ./output 5000000 | java -server -Xbatch sumcol
master value: 5000000
5000000

real 0m0.751s
user 0m0.824s
sys 0m0.096s


Here we can see that Java is slightly faster than Mono when comparing the 'real' and 'user' times.

I figured that in order to get a more accurate reading, I should re-run using a larger value, so...


[fejj@moonlight benchmarks]$ time ./output 21474836 | java sumcol
master value: 21474836
21474836

real 0m2.625s
user 0m3.236s
sys 0m0.304s
[fejj@moonlight benchmarks]$ time ./output 21474836 | mono sumcol.exe
master value: 21474836
21474836

real 0m3.225s
user 0m3.712s
sys 0m0.272s


Again, Java is faster according to 'real' and 'user' times.

What have we learned?

The Java program appears to be slightly faster if and only if we do not overflow the signed integer, otherwise Mono takes the prize.

If we revert back to using the original program I wrote to generate input and re-run using our most recent 'max' value, we find:


[fejj@moonlight benchmarks]$ time ./output 21474836 | java sumcol
master value: 370655678
2147483647

real 0m19.157s
user 0m22.757s
sys 0m0.568s
[fejj@moonlight benchmarks]$ time ./output 21474836 | mono sumcol.exe
master value: 370655678
370655678

real 0m6.004s
user 0m9.513s
sys 0m0.940s


Here we see that Mono clearly outperforms Java, no matter how you slice it.

Conclusion?

You can't compare the performance of the Java and C# programs without first understanding the implications of the input provided to them.

Had the Debian Language Shootout used a different input, the performance tests might have instead shown Java getting raped by Mono for this particular test. Basically, Java just got lucky that the inputs were in their favor.

Update: Oops, I forgot to give specifics of my Mono/Java versions and the specs of my laptop:

Mono: JIT compiler version 1.9 (/trunk/ r103228)

java -version gives the following output:

java version "1.6.0_06"
Java(TM) SE Runtime Environment (build 1.6.0_06-b02)
Java HotSpot(TM) Server VM (build 10.0-b22, mixed mode)

The hardware used was my Lenovo T61 Core2Duo @ 2.4GHz

This was running on OpenSuSE 10.3 with the latest updates.

Update 2: I was pointed to http://shootout.alioth.debian.org/download/sumcol-input.txt to use as input for this SumFile test. Below I've pasted the results:


[fejj@moonlight benchmarks]$ time java -server -Xbatch sumcol < ~/sumcol-input.txt
500

real 0m0.137s
user 0m0.052s
sys 0m0.028s
[fejj@moonlight benchmarks]$ time mono sumcol.exe < ~/sumcol-input.txt
500

real 0m0.078s
user 0m0.060s
sys 0m0.012s


Sorry guys, but even with this input, Java is not 2x faster than Mono.

Update 3: Since the above comparisons were all done using the StreamTokenizer version of the Java implementation, I figured I should go back and get the run times of the original Java implementation.

Using my original implementation of output.c, each line having a value 1 larger than the previous line, we get:


[fejj@moonlight benchmarks]$ time ./output 5000000 | java -server -Xbatch sumcol
master value: 1647668640
1647668640

real 0m1.492s
user 0m2.272s
sys 0m0.168s


Interesting note: this Java implementation does not get the wrong answer...

The results for "1" on ever line gives the following results:


[fejj@moonlight benchmarks]$ time ./output 5000000 | java -server -Xbatch sumcol
master value: 5000000
5000000

real 0m0.924s
user 0m0.984s
sys 0m0.176s


Repeating the trial run that totally killed Java's performance with the Tokenizer implementation, we get:


[fejj@moonlight benchmarks]$ time ./output 21474836 | java -server -Xbatch sumcol
master value: 370655678
370655678

real 0m6.564s
user 0m9.893s
sys 0m0.688s


And finally, the results for the sumcol-input.txt, we get:


[fejj@moonlight benchmarks]$ time java -server -Xbatch sumcol < ~/sumcol-input.txt
500

real 0m0.107s
user 0m0.056s
sys 0m0.032s


Conclusion:

The original Java implementation gets much better results than the StreamTokenizer implementation which was supposedly faster according to the Benchmarks table, however, it's still not 2x faster than the Mono implementation (in fact, this version of the Java implementation is roughly on par with the C# implementation).

Update 4: One of the comments to this blog post requested I try the Java6 SumCol #3 program which includes its own buffering/tokenizer implementation.

Since this program "cheats" a bit, I figured I'd pit it against my own implementation of equal craftyness ;-)

You can find my C# implementation here.

The results are as follows:


[fejj@moonlight benchmarks]$ time ./output 21474836 | mono sumcol2.exe
master value: 370655678
370655678

real 0m4.996s
user 0m4.904s
sys 0m0.676s
[fejj@moonlight benchmarks]$ time ./output 21474836 | java -server -Xbatch sumcol3
master value: 370655678
370655678

real 0m5.117s
user 0m5.748s
sys 0m1.088s


Am I a crafty bugger or what? ;-)

Running these implementations against sumcol-input.txt, I get the following results:


[fejj@moonlight benchmarks]$ time mono sumcol2.exe < ~/sumcol-input.txt
500

real 0m0.076s
user 0m0.064s
sys 0m0.016s
[fejj@moonlight benchmarks]$ time java -server -Xbatch sumcol3 < ~/sumcol-input.txt
500

real 0m0.108s
user 0m0.064s
sys 0m0.040s


Damn I'm good! ;-)

Update 5: It's been suggested that the -server and -Xbatch args might make Java slower, so I've redone the tests (this time using sumcol-input.txt repeated 100,000 times)

Here are the results:


[fejj@moonlight benchmarks]$ time java -server sumcol < sumcol-input100000.txt
50000000

real 0m19.025s
user 0m18.657s
sys 0m0.412s
[fejj@moonlight benchmarks]$ time java sumcol < sumcol-input100000.txt 50000000

real 0m20.643s
user 0m20.269s
sys 0m0.364s
[fejj@moonlight benchmarks]$ time mono sumcol.exe < sumcol-input100000.txt
50000000

real 0m18.884s
user 0m18.245s
sys 0m0.604s


In the case of Java, it looks like using the -server flag performs a little bit better on this particular test than without.

Overall, the Java and Mono performance for this test are remarkably close.

In the end, I think I can feel pretty confidant that Mono's performance on this test is acceptable and it's not worth wasting any more time on it for now.

Update 6: Okay, one last update to show that a fully managed C# implementation can outperform the fastest C implementation submitted to the Debian Language Benchmarks:


[fejj@moonlight benchmarks]$ time mono sumcol2.exe < sumcol-input100000.txt
50000000

real 0m2.906s
user 0m2.572s
sys 0m0.308s
[fejj@moonlight benchmarks]$ time ./sumcol < sumcol-input100000.txt
50000000

real 0m13.967s
user 0m13.277s
sys 0m0.668s


Yes, ladies and gentlemen, that is C# wiping the floor with C.

Saturday, May 10, 2008

Wouldn't it be nice if...

Over the past week, I've been spending some time hacking on Evolution again because of my frustration with the current IMAP backend. This got me to wondering... why hasn't anyone stepped up to the plate and rewritten Evolution's IMAP code yet?

I think the reason can be summed up with the following 2 problems:

1. IMAP is hard
2. Coding something complicated like a multithreaded multiplexed IMAP client library in C is harder.

That got me and Michael Hutchinson wondering... wouldn't it be nice if we could write Camel provider plugins in C# or in any other managed language that we might prefer?

I think Camel, like Evolution itself, should allow developers to implement plugins in C# as well. I really think this might help lessen the burden of implementing new mail protocol backends for Camel/Evolution.

On that note, I've created a new svn module called camel-imap4 which can be built against your installed evolution-data-server devel packages.

Currently, however, it'll probably only work with e-d-s >= 2.23.x because some things (and assumptions) in Camel have changed recently.

One problem I'm having is that the symbol camel_folder_info_new() used to not exist in older versions of e-d-s, but recently that symbol was added and makes use of g_slice_alloc0(). The problem is that the way providers used to allocate CamelFolderInfo structures before was using g_new0() themselves. Why does this pose a problem? There's no guarantee that I'm aware of that you can mix and match g_malloc/g_slice_free or g_slice_alloc/g_free.

This makes it difficult for me to implement a plugin that builds and works with my installed version of Evolution (2.12) and also work with Evolution svn (2.23). This is quite unfortunate :(

While I'm at it, allow me to also propose some changes to the GChecksum API. Please, please, please make it so that we ned not allocate/free a GChecksum variable each time we need to checksum something?

I propose the ability to do the following:

GChecksum checksum;

g_checksum_init (&checksum, G_CHECKSUM_MD5);
g_checksum_update (&checksum, data, len);
g_checksum_get_digest (&checksum, digest, &len);

Then I'd like to be able to either call g_checksum_init() on checksum again or maybe have another function to clear state, maybe g_checksum_clear() which would allow me to once again use the same checksum variable for calculating the md5 of some other chunk of data.

Camel generates md5sums for a lot of data, sometimes in loops. Having to alloc/free every iteration is inefficient and tedious.

Update: It now builds and works with Evolution 2.12 (I haven't tested anything else). But the new and improved IMAP back-end for Evolution is now actually working. Whoohoo!

Monday, April 21, 2008

Interesting...

Marc Maurer makes an interesting observation about OOXML vs ODF. Aparently, not a single GSoC applicant proposed to improve upon the ODF implementation in AbiWord, while they had numerous applicants that wanted to work on OOXML support.

I wonder what the ODF vs OOXML proposal stats are for some of the other free software office applications.

Friday, March 21, 2008

Moonlight Text Rendering

For those who haven't been following Moonlight development, I'm the guy that has been implementing all the text layout and rendering.

Let me start off by saying that text is hard. Really hard. Especially layout.

Font Hinting

The aspect of text rendering that I've been thinking a lot about lately is hinting. For an example of the difference that hinting makes with text rendering, see the following screenshot:



In the screenshot above, I've posted side-by-side text renderings produced by both Moonlight and Pango (courtesy of Michael Dominic K.) for Arial, Verdana and Times New Roman fonts. You'll notice that especially at the smaller font sizes, Moonlight's text rendering gets a bit fuzzy and hard to read. On the contrary, Pango's rendering, using hinting, has improved readability over Moonlight.

Why don't I just use hinting, you ask?

The problem is that hinting changes the font metrics in such a way that as you scale to larger sizes, the metrics do not scale uniformly (you'll notice that as the font sizes increase, the width of the line gets uniformly wider using Moonlight, but not so with Pango). This causes a bit of a problem for Moonlight (and Silverlight) because:


  • with hinting enabled, scaling the font size in an animation would look jumpy

  • this is made even worse when you consider line wrapping and how changing font metrics would mean that line wrapping would also change with the changing font size


In most applications, the text in each UI element is unlikely to change size (unless the user manually changes the font settings), and so enabling the use of hinting has no adverse side effects. Silverlight (and thus Moonlight), however, are meant to do all sorts of animations which may apply matrix transforms (such as scaling) to any item in the canvas (even text!).

Since scaling has to look smooth, it's hard to use hinting because of the way it changes the metrics.

Hopefully I can find a way to improve rendering quality at the smaller sizes without getting the side effects I mentioned above, or at least have them be far less noticeable.

Line Breaking

While not thinking about the render quality of the text, the bane of my existence has been implementing layout algorithms for the 3 different TextWrapping modes used in Silverlight (Wrap, NoWrap, and WrapWithOverflow).

I've mostly been focusing on TextWrapping=Wrap lately as it's probably the hardest one to implement, especially if I want to emulate the Silverlight wrapping behavior (from what I can gather by reading the Pango source code, there are ambiguities in the Unicode specification for line-breaking, so I've been having to familiarize myself with the rules for line-breaking lately).

Going Forward

In the future, I hope to replace my text layout/rendering code with the use of Pango but in order for that to happen, Pango will need a few new features:

  • The ability to specify fonts outside of the configured font directories (e.g. a way to bypass FontConfig). From what I understand, this is already happening so I expect to be able to cross this off the list soon ;-)

  • The ability to mark the beginning/end of a sub-run of text. Silverlight's <TextBlock> element supports child <Run> elements. This normally wouldn't be a big deal, but Silverlight uses the joint between two runs as a possible break opportunity. For example, <Run Text="abcdefg"/><Run Text="hijklmn"/> might line-wrap differently than <Run Text="abcdefghijklmn"/> even though there are no spaces between the two runs. Currently, when rendering a string of text using Pango, you combine all runs into a single string to pass to a PangoLayout, so the layout engine doesn't have a way of knowing that a position mid-word should be treated as a break opportunity.

  • Pango will need a way to specify alternative text wrapping modes (so that it would be possible to emulate Silverlight's TextWrapping behavior). Apparently Owen and Behdad have already discussed the desire to have this sort of feature, so I'm looking forward to working with them on designing a suitable API.

  • A way to specify that font metrics should scale uniformly as the font size changes. Update: Behdad has notified me that this feature is actively being worked on and that I can enable it via CAIRO_HINT_METRICS_OFF.

  • The last thing that I can think of at the moment that Pango will need is a way to reset the Foreground brush for each run and each time each of those runs line-breaks. For example, the following screenshot consists of a single TextBlock with two Run elements which share their parent's font and foreground brush properties. This means that the prepare_run() virtual method for the subclassed PangoRenderer will need to know the extents of the next sub-run of text to be rendered so that it can properly setup the custom foreground brush.





In the meantime, having to write my own layout/rendering engine for text has taught me a lot more about text and fonts than I ever wanted to know and I'm continuing to learn more every day.

Update: With all of these custom changes that I'll need in order to move to Pango, I'm beginning to wonder if it's really worth it? The only thing that it could get me is the Unicode line-breaking logic, but the thing is... if I have to implement my own pluggable TextWrapping modes, then wouldn't I still essentially be implementing my own line-breaking logic? This makes me very sad :(

Tuesday, February 26, 2008

Lots of GNOME/Mono FUD Lately

It would seem that lately there are a lot of FUD-spreading trolls crawling out of the woodwork trying to frighten people into thinking that GNOME somehow depends on Mono.

Let's take a look at their most widely repeated claims:

GNOME depends on Mono

This is simply untrue... to see for yourself, try removing Mono from your Linux system using yum, zypper, or apt (or whatever) and you will plainly see that it does not remove GNOME - it may remove Tomboy, F-Spot, Banshee and/or Beagle (if you have any of them installed), but it will not remove any core components of your GNOME system.

Now that we've proven that GNOME does not depend on Mono, let's move on to their next claim:

GNOME depends on libbeagle, a Mono program

Again, false. GNOME's help browser has an optional dependency on a C-library called libbeagle which can be used to make search queries via IPC (Inter-Process-Communication) to the Beagle daemon if and only if Beagle is installed. I.E., Yelp does not depend on Beagle, it is simply able to take advantage of Beagle if it is available. As far as I'm aware, there are plans to replace libbeagle with a more generic library able to communicate with either Beagle or Tracker via IPC, depending on which one the user has installed.

NDesk-DBus is replacing DBus in GNOME

There was a big stink about this a while ago by a very angry person who didn't understand how libraries or software development in general works.

GNOME continues to depend on the C-implementation of DBus (called libdbus) and that is unlikely to change unless DBus itself (the technology, not the C-library) gets replaced.

NDesk-DBus, written by the famous Alp Toker, is a replacement for the current DBus-Sharp .NET binding around libdbus, the C-implementation. The main difference between NDesk-DBus and DBus-Sharp is that DBus-Sharp wraps libdbus while NDesk-DBus is a fully managed implementation of the DBus wire protocol.

GNOME not only does not, but it cannot, replace DBus with NDesk-DBus - any half-way competent programmer would know and understand why this is so. Native C programs cannot easily call into managed code.

Someday soon it will be practically impossible to write any app for GNOME without being forced to use MONO

Considering that in order for this to happen, core GNOME libraries would have to be rewritten in a .NET language, this is unlikely to ever happen, never mind anytime soon.

As you can plainly see, these types of claims are being made by people who do not understand the most basic concepts of software development.

GNOME is riddled with Mono applications

If by "riddled" they mean "not", then yes. ;-)

There is currently only one official GNOME application that uses Mono and that application is Tomboy. Tomboy, as useful a tool as it is, is not even arguably a core component of the GNOME desktop in that removing it will not somehow make your GNOME desktop useless.

Novell is forcing GNOME to include Mono

Ask any core GNOME developer and he or she will tell you that this is absurd. To the best of my knowledge, Novell has not even once requested that any of its Mono-based applications be accepted as core GNOME applications, never mind forced one to be accepted. I should also note that Tomboy, the only Mono application in GNOME currently, was not a Novell project at the time it as accepted for inclusion in GNOME.


This post has been brought to you by the letters G and M and the number 2.

Saturday, February 23, 2008

Speaking of Hack Week Projects...

I didn't really have any ideas for Hack Week this year, so I had started off working with Paolo and Zoltan on optimizing Mono's RegEx engine. Unfortunately 1 week wasn't enough to accomplish what I had hoped to accomplish - luckily, Paolo and Zoltan were able to improve Mono's RegEx performance significantly without the need for the hacks I had planned to implement (and in fact did some a terrific job, I'm not sure my ideas would have made much of a difference).

That said, I'd like to jot down a few ideas I just had for the next Hack Week:


  • Implement an "Evolution Plugin" project type for MonoDevelop which would setup everything you need to start writing plugins for Evolution in C#. This would include things like templating the E-Plug xml files, pulling in appropriate Evolution# bindings and the like.

  • Implement an Evolution Plugin in C# that would filter spam messages based on charset information. Most of the spam I get seems to use a russian or asian charset, so a filter like this would be extremely valuable to me.


My biggest interest is in templating out an Evolution Plugin project type for MonoDevelop because I'd really like to see more outside developers writing plugins for Evolution and I think that this would be a great way to lower the bar, so to speak.

Wednesday, February 6, 2008

Optimizing GMime's UUEncoder

This past weekend I was talking with Andreia about how Pan is built on top of GMime and takes advantage of my awesomely speedy uuencode/uudecode routines which reminded me that I had done some performance comparisons of GMime's uuencode program vs. the one in the GNU sharutils package a number of years ago.

I had compared GMime 1.90.0 (which was a pre-release of GMime 2.0) and GNU sharutils 4.2.0 and the results were pretty impressive... GMime's uuencoder was on the order of 3 times faster than the one in sharutils and produced exactly the same results.

The uudecoder and the base64 encoder/decoder were all roughly on the order of 7 times faster than those in GNU sharutils, so all around GMime outperformed GNU sharutils by quite a bit.

Anyways, re-reading my test results got me thinking that my uuencode routines could probably be optimized a bit more as they were lagging a bit behind the base64 encoder routine and there's really no reason it should be that far off.

Well, tonight I finally got off my butt and decided to take a look and figure out why. Upon scrolling down to my uuencode_step() routine, I immediately saw why:

Each loop would collect up to 3 bytes from the input and bit shift them into a 32bit 'saved' variable (which is a state variable used for incremental uuencoding an input stream). Then, if I had successfully extracted 3 bytes from the input, I would extract them out of 'saved' into 3 unsigned char variables. At this point I would then encode them into a temporary output buffer. When this output buffer ('uubuf') grew to 60 bytes, I'd flush it to the real output buffer with a memcpy().

All of this extra copying of data around adds up after a while and really begins to impact performance.

Before making any changes, I timed how long it took the original version of my uuencode_step() function to encode linux-2.6.24.tar.gz on my system[1]. An average result over numerous runs was as follows:


[fejj@localhost ~]$ time `gmime-uuencode linux-2.6.24.tar.gz linux-2.6.24.tar.gz > /dev/null`
real 0m0.470s
user 0m0.412s
sys 0m0.052s


After my rewrite, my new results were closer to:

[fejj@localhost ~]$ time `gmime-uuencode linux-2.6.24.tar.gz linux-2.6.24.tar.gz > /dev/null`
real 0m0.291s
user 0m0.252s
sys 0m0.024s


For the sake of comparison, the best time I could manage to get from GNU sharutils 4.6.2 was as follows:

[fejj@localhost ~]$ time `uuencode linux-2.6.24.tar.gz linux-2.6.24.tar.gz > /dev/null`
real 0m1.386s
user 0m1.276s
sys 0m0.092s


The new implementation of uuencode_step() in gmime/gmime-utils.c has been committed to the gmime svn module on GNOME's subversion server, revision 1216 - this change should appear in the next release of GMime which will likely be 2.2.17.


Notes:

1. The system I tested this on was my Lenovo T61 laptop w/ a 7200 RPM harddrive running OpenSuSE 10.3 with various updates. The kernel was version 2.6.22.13-0.3-bigsmp.

From /proc/cpuinfo:

model name : Intel(R) Core(TM)2 Duo CPU T7700 @ 2.40GHz
cpu MHz : 800.000

(e.g. my cpu was scaled down at the time of testing)

2. The GMime uuencode implementation uses a GMimeStreamFs for input as well as output. This stream class is a wrapper around the POSIX I/O system functions which unfortunately has a sub-optimal need to perform an lseek() before each read() or write() call in order to make sure that the underlying file descriptor is in the expected position. This is necessary because it is possible for multiple streams to re-use the same fd.

I mention this because an obvious rebuttal to GMime's superior performance might be to suspect that GMime's uuencode implementation "cheated" by using an mmap()'d input buffer where the GNU sharutils implementation might not.