Saturday, May 17, 2008

Debian Language Benchmarks - SumFile

Since someone brought it up the other day on IRC, I figured I'd take a look at some of the Java vs C# benchmarks where Java was claimed to be faster than C#.

Scrolling through the list, there were 3 tests where Java was supposedly more than 2x faster than C# on Mono.

The SumFile test looked fairly simple and I figured the bottleneck there might be related to something silly being done with StandardInput (Miguel replaced the Int.Parse() routines with my more optimal implementations a while back, so I figured that was probably ok).

The first thing I tried to do was reproduce their findings, so I grabbed the Java and C# sources from the Debian site and compiled them.

Both programs read from StandardInput an integer value per line, so I wrote a simple program to generate some input for the test:

#include <stdio.h>
#include <stdlib.h>

int main (int argc, char **argv)
{
    unsigned int max, i;
    int sum = 0;

    max = atoi (argv[1]);

    for (i = 0; i <= max; i++) {
        printf ("%u\n", i);
        sum += i;
    }

    fprintf (stderr, "master value: %d\n", sum);

    return 0;
}

As you can see above, the program outputs the real sum of all the digits it is outputting to stderr so that we can compare the results of the C# and Java programs to make sure they are correct for 32bit signed integers.

Then I compiled the Java and C# versions using the same compile options mentioned on the Debian Language Shootout pages for each language:

javac -classpath '.' sumcol.java

and

gmcs sumcol.cs

I then ran each program using the following commands (using the same java and mono command-line options):

[fejj@moonlight benchmarks]$ time ./output 5000000 | java -server -Xbatch sumcol
master value: 1647668640
2147483647

real    0m4.606s
user    0m5.264s
sys     0m0.276s
[fejj@moonlight benchmarks]$ time ./output 5000000 | mono sumcol.exe
master value: 1647668640
1647668640

real    0m1.415s
user    0m2.136s
sys     0m0.228s

As you can see above, there's no way that Java is 2.3x faster than Mono... so I'm not sure where they are getting these results from. Not only that, but Java is getting the wrong answer!

My best guess is that Java does not allow integer wrapping, so I suppose that that is okay... but it's a bit odd.

For comparison's sake, here's what I get with the c implementation of sumfile:

[fejj@moonlight benchmarks]$ time ./output 5000000 | ./sumcol
master value: 1647668640
1647668640

real    0m1.043s
user    0m1.604s
sys     0m0.188s

Thinking that Java might be taking a performance hit from the bounds checking, I changed my original number-generating program to always print '1' on each line:

#include <stdio.h>
#include <stdlib.h>

int main (int argc, char **argv)
{
    unsigned int i, max;
    int sum = 0;

    max = atoi (argv[1]);

    for (i = 0; i < max; i++) {
        printf ("1\n");
        sum++;
    }

    fprintf (stderr, "master value: %d\n", sum);

    return 0;
}

Running the C, Mono, and Java versions again, I get the following results:

[fejj@moonlight benchmarks]$ time ./output 5000000 | ./sumcol
master value: 5000000
5000000

real    0m0.601s
user    0m0.828s
sys     0m0.032s
[fejj@moonlight benchmarks]$ time ./output 5000000 | mono sumcol.exe
master value: 5000000
5000000

real    0m0.774s
user    0m0.876s
sys     0m0.064s
[fejj@moonlight benchmarks]$ time ./output 5000000 | java -server -Xbatch sumcol
master value: 5000000
5000000

real    0m0.751s
user    0m0.824s
sys     0m0.096s

Here we can see that Java is slightly faster than Mono when comparing the 'real' and 'user' times.

I figured that in order to get a more accurate reading, I should re-run using a larger value, so...

[fejj@moonlight benchmarks]$ time ./output 21474836 | java sumcol
master value: 21474836
21474836

real    0m2.625s
user    0m3.236s
sys     0m0.304s
[fejj@moonlight benchmarks]$ time ./output 21474836 | mono sumcol.exe 
master value: 21474836
21474836

real    0m3.225s
user    0m3.712s
sys     0m0.272s

Again, Java is faster according to 'real' and 'user' times.

What have we learned?

The Java program appears to be slightly faster if and only if we do not overflow the signed integer, otherwise Mono takes the prize.

If we revert back to using the original program I wrote to generate input and re-run using our most recent 'max' value, we find:

[fejj@moonlight benchmarks]$ time ./output 21474836 | java sumcol
master value: 370655678
2147483647

real    0m19.157s
user    0m22.757s
sys     0m0.568s
[fejj@moonlight benchmarks]$ time ./output 21474836 | mono sumcol.exe
master value: 370655678
370655678

real    0m6.004s
user    0m9.513s
sys     0m0.940s

Here we see that Mono clearly outperforms Java, no matter how you slice it.

Conclusion?

You can't compare the performance of the Java and C# programs without first understanding the implications of the input provided to them.

Had the Debian Language Shootout used a different input, the performance tests might have instead shown Java getting raped by Mono for this particular test. Basically, Java just got lucky that the inputs were in their favor.

Update: Oops, I forgot to give specifics of my Mono/Java versions and the specs of my laptop:

Mono: JIT compiler version 1.9 (/trunk/ r103228)

java -version gives the following output:

java version "1.6.0_06"
Java(TM) SE Runtime Environment (build 1.6.0_06-b02)
Java HotSpot(TM) Server VM (build 10.0-b22, mixed mode)

The hardware used was my Lenovo T61 Core2Duo @ 2.4GHz

This was running on OpenSuSE 10.3 with the latest updates.

Update 2: I was pointed to http://shootout.alioth.debian.org/download/sumcol-input.txt to use as input for this SumFile test. Below I've pasted the results:

[fejj@moonlight benchmarks]$ time java -server -Xbatch sumcol < ~/sumcol-input.txt 
500

real    0m0.137s
user    0m0.052s
sys     0m0.028s
[fejj@moonlight benchmarks]$ time mono sumcol.exe < ~/sumcol-input.txt 
500

real    0m0.078s
user    0m0.060s
sys     0m0.012s

Sorry guys, but even with this input, Java is not 2x faster than Mono.

Update 3: Since the above comparisons were all done using the StreamTokenizer version of the Java implementation, I figured I should go back and get the run times of the original Java implementation.

Using my original implementation of output.c, each line having a value 1 larger than the previous line, we get:

[fejj@moonlight benchmarks]$ time ./output 5000000 | java -server -Xbatch sumcol
master value: 1647668640
1647668640

real    0m1.492s
user    0m2.272s
sys     0m0.168s

Interesting note: this Java implementation does not get the wrong answer...

The results for "1" on ever line gives the following results:

[fejj@moonlight benchmarks]$ time ./output 5000000 | java -server -Xbatch sumcol
master value: 5000000
5000000

real    0m0.924s
user    0m0.984s
sys     0m0.176s

Repeating the trial run that totally killed Java's performance with the Tokenizer implementation, we get:

[fejj@moonlight benchmarks]$ time ./output 21474836 | java -server -Xbatch sumcol
master value: 370655678
370655678

real    0m6.564s
user    0m9.893s
sys     0m0.688s

And finally, the results for the sumcol-input.txt, we get:

[fejj@moonlight benchmarks]$ time java -server -Xbatch sumcol < ~/sumcol-input.txt 
500

real    0m0.107s
user    0m0.056s
sys     0m0.032s

Conclusion:

The original Java implementation gets much better results than the StreamTokenizer implementation which was supposedly faster according to the Benchmarks table, however, it's still not 2x faster than the Mono implementation (in fact, this version of the Java implementation is roughly on par with the C# implementation).

Update 4: One of the comments to this blog post requested I try the Java6 SumCol #3 program which includes its own buffering/tokenizer implementation.

Since this program "cheats" a bit, I figured I'd pit it against my own implementation of equal craftyness ;-)

You can find my C# implementation here.

The results are as follows:

[fejj@moonlight benchmarks]$ time ./output 21474836 | mono sumcol2.exe
master value: 370655678
370655678

real    0m4.996s
user    0m4.904s
sys     0m0.676s
[fejj@moonlight benchmarks]$ time ./output 21474836 | java -server -Xbatch sumcol3
master value: 370655678
370655678

real    0m5.117s
user    0m5.748s
sys     0m1.088s

Am I a crafty bugger or what? ;-)

Running these implementations against sumcol-input.txt, I get the following results:

[fejj@moonlight benchmarks]$ time mono sumcol2.exe < ~/sumcol-input.txt 
500

real    0m0.076s
user    0m0.064s
sys     0m0.016s
[fejj@moonlight benchmarks]$ time java -server -Xbatch sumcol3 < ~/sumcol-input.txt 
500

real    0m0.108s
user    0m0.064s
sys     0m0.040s

Damn I'm good! ;-)

Update 5: It's been suggested that the -server and -Xbatch args might make Java slower, so I've redone the tests (this time using sumcol-input.txt repeated 100,000 times)

Here are the results:

[fejj@moonlight benchmarks]$ time java -server sumcol < sumcol-input100000.txt 
50000000

real    0m19.025s
user    0m18.657s
sys     0m0.412s
[fejj@moonlight benchmarks]$ time java sumcol < sumcol-input100000.txt 50000000

real    0m20.643s
user    0m20.269s
sys     0m0.364s
[fejj@moonlight benchmarks]$ time mono sumcol.exe < sumcol-input100000.txt 
50000000

real    0m18.884s
user    0m18.245s
sys     0m0.604s

In the case of Java, it looks like using the -server flag performs a little bit better on this particular test than without.

Overall, the Java and Mono performance for this test are remarkably close.

In the end, I think I can feel pretty confidant that Mono's performance on this test is acceptable and it's not worth wasting any more time on it for now.

Update 6: Okay, one last update to show that a fully managed C# implementation can outperform the fastest C implementation submitted to the Debian Language Benchmarks:

[fejj@moonlight benchmarks]$ time mono sumcol2.exe < sumcol-input100000.txt 
50000000

real    0m2.906s
user    0m2.572s
sys     0m0.308s
[fejj@moonlight benchmarks]$ time ./sumcol < sumcol-input100000.txt 
50000000

real    0m13.967s
user    0m13.277s
sys     0m0.668s

Yes, ladies and gentlemen, that is C# wiping the floor with C.

79 comments:

Barry Kelly said...

What versions, respectively, of the JVM and Mono were you using? And what was the machine spec (multicore? etc.)?

Jeffrey Stedfast said...

Sorry about that, I meant to provide that info but I got distracted.

I've updated my blog with that info.


Currently I'm playing with the Mandelbrot tests and Java seems to be doing a lot better than Mono with this one. I have, however, managed to improve the performance of the C# implementation a bit by using constants and pre-calculating some stuff (like the Java version does), but Java is still about 2x faster...

I just commented out the C# code that involves I/O and the time it takes for C# to complete didn't change much, so I guess I can rule out I/O performance as the bottleneck ;-)

Anonymous said...

>> What have we learned?

That synthetic benchmarks as simple as this are pretty much useless? That you have a vendetta against Java?

Aside from that. Nothing really.

Tristan Tarrant said...

>That you have a vendetta against Java?

Indeed that's the perception I get from all these Mono guys. I remember a blog post by Miguel on the same subject, and how surprised he was that the JVM was faster than Mono in certain areas. It isn't surprising: the JVM has been around forever.

Anonymous said...

> It isn't surprising: the JVM has been around forever.
Just because something is old doesn't neccessarily mean it is better or more advanced. In basically all tests Java uses (average!) twice as much memory as mono to solve the task. Why, if Java would be better because of being older?
The point is mono (and its contributers) are putting some resources into perfomance (memory and speed) and mono gets used in several places (like games) where this IS a requirement. Java positions itself more towards servers. It doesn't really matter if you waste a few MB/GB RAM there you just put in some more...

> That synthetic benchmarks as simple as this are pretty much useless?
While you have to be carefuly about what they tell you they aren't useless. If you compare mono with java you see instantly pop out the Regex benchmark where mono looses for both memory and perf big time. The reason for that is simple: The mono regex impl DOES suck performance-wise.

Anonymous said...

>>>That you have a vendetta against Java?

Buh? Java supporters run around quoting these tests as definitive answers as to why Java is better than mono? How is it a 'vendetta against java' when you point out that the benchmark is wrong? Do you have a problem with Mono actually being faster in some circumstances?

The funny thing is, i'd nearly go as far as saying that *no* micro benchmark is useful. A tiny change in parameters (as you saw here) can result in a huge swing in performance. So what have we learnt out of this? Nothing really, except that in that particular test, mono can be quite a lot faster than java. Big whoop.

Anonymous said...

some bench here:
http://www.kano.net/javabench/

only c++ and java....

if somebody could write in c# to compare with them....

Anonymous said...

I get the correct answer for Java.
Are you using this version of java?

http://shootout.alioth.debian.org/debian/benchmark.php?test=sumcol&lang=java&id=2

Also, according to benchmark rules, the file only has numbers in a colmn. Make sure you are using the right file that looks something like this:

http://shootout.alioth.debian.org/download/sumcol-input.txt

Anonymous said...

Also, try this version of java

http://shootout.alioth.debian.org/gp4/benchmark.php?test=sumcol&lang=java&id=3

what times you get? This version however breaks the benchmark rule of reading only line by line

Jeffrey Stedfast said...

I can assure you, I have absolutely no vendetta against Java.

As you may have seen in my second comment, I started checking some of the other tests and the Mandelbrot test for example is still a clear victor over the C# implementation. I even tried rewriting the C# Mandelbrot test to be an exact port of the Java implementation in case that made a difference, and still Java was 2x faster.

Basically, I started looking at these because I was curious as to why Java was faster than Mono for some of these tests.

I figured I'd run the benchmarks and do a little digging for some of the easier ones, see if there was some low-hanging fruit I could fix to result in better performance.

For Novell HackWeek back in February (I think?), Paolo and Zoltan did something similar to address the performance suckage of the Regex implementation and by the end of the week, they had an implementation that was faster than all of the languages other than Tcl. Hopefully they'll finish that up and fold it into trunk before long.

So in the end, I don't see how anyone could accuse me of having a vendetta against Java. My goal is simply to improve the state of Mono's performance.

Anonymous said...

Well, I interested in hearing how C# Mono performs against this version of java

http://shootout.alioth.debian.org/gp4/benchmark.php?test=sumcol&lang=java&id=3

Jeffrey Stedfast said...

Anonymous: The link you gave me to the more optimal Java implementation turns out to be the implementation I tested (I didn't realize it was the "alternative" version).

So the numbers you see in the blog are using the Tokenizer version, not the LineBuffer version of the Java implementation.

I'll test out the LineBuffer version momentarily to see if perhaps it performs better on my system than the Tokenizer version and then report back my findings.

Anonymous said...

No, Tokenizer is the right version. The other link that I gave is much faster than Tokenizer version but it was rejected because the rules are that the program must use built-in line reader.

Jeffrey Stedfast said...

I've just posted the results of the line-reader Java implementation to the main blog post (no sense hiding it in the comments, eh? ;)

This implementation seemed to be a lot better.

I'll have to be careful about which implementations I use for comparisons in the future - I missed that I was grabbing "Java6 -server #2" from the table yesterday.

Jeffrey Stedfast said...

Anonymous: ah. The link didn't work, it sent me to some other test, so I had navigated back to the table and noticed that I had tested the StreamTokenizer version instead of the line-reader version, so had assumed you wanted to point me at the StreamTokenizer version (since your post mentioned that it breaks the rules of line-reading).

Did you mean to point me at the "Java 6 -server #3" version?

I'll test that out in a sec.

Anonymous said...

Yes, this version

http://shootout.alioth.debian.org/gp4/benchmark.php?test=sumcol&lang=java&id=3

Can you try that one too :)

Jeffrey Stedfast said...

Here are the results for the 2 more interesting inputs on my machine for Java6's SumCol #3:

[fejj@moonlight benchmarks]$ time java -server -Xbatch sumcol3 < ~/sumcol-input.txt
500

real 0m0.142s
user 0m0.048s
sys 0m0.036s
[fejj@moonlight benchmarks]$ time ./output 21474836 | java -server -Xbatch sumcol3
master value: 370655678
370655678

real 0m4.814s
user 0m5.648s
sys 0m0.812s


Still slower than Mono for sumcol-input.txt, but noticeably faster for the really large input file.

It makes sense that this is faster, though, because it "cheats" by reducing the number of passes over the input data. I'm sure a similar C# implementation would show similar results.

Anonymous said...

Weird that like the site, Tokenizer version is faster on my comp than the BufferedReader one.

Perhaps if you change this line:

StreamTokenizer lineTokenizer = new StreamTokenizer(System.in);

to this:

StreamTokenizer lineTokenizer = new StreamTokenizer(new BufferedInputStream(System.in));

that will make the difference on your system. You will also need the import

import java.io.BufferedInputStream;

Jeffrey Stedfast said...

I just tried your suggestion, but it didn't seem to help the Java6 SumCol #2 implementation performance at all:

[fejj@moonlight benchmarks]$ time ./output 21474836 | java -server -Xbatch sumcol2
master value: 370655678
2147483647

real 0m19.829s
user 0m24.646s
sys 0m0.312s

Jeffrey Stedfast said...

I suspect that the i/o isn't the performance bottleneck for the StreamTokenizer implementation, but rather something to do with overflowing an integer.

I don't fully understand why it's a problem with the StreamTokenizer implementation and not the Line-Reader implementation, but considering that the StreamTokenizer doesn't get the correct answer when overflowing a signed int either, I think it's gotta be something there.

Would probably need a "Java expert" to look into it to be sure (and I'm no Java expert ;)

Anonymous said...

Do you have GCJ installed on your computer?

GCJ comes with libgcj-3.4.4.jar that has a horrible implementation of java.io.StreamTokenizer. Perhaps HotSpot finds GCJ implementation of StreamTokinzer first (maybe if it's in classpath or something) and uses that instead?

I have no other explanation since on my system StreamTokenizer is much faster than BufferedReader implementation.

Jeffrey Stedfast said...

[fejj@moonlight benchmarks]$ rpm -qa | grep gcj
[fejj@moonlight benchmarks]$

Doesn't look like it's installed...

Anonymous said...

[sarcasm]
"more optimal"? Hopefully soon you have the "most optimal" solution.
[/sarcasm]

eTM

Jeffrey Stedfast said...

Eureka!

I have found the bug in the StreamTokenizer version of the Java implementation.

The problem is that StreamTokenizer.nval is a double, and not an int.

If I change the line:

sum += lineTokenizer.nval;

to:

sum += (int) lineTokenizer.nval;

and then re-run the "killer" test, I get about double the performance it had previously:

[fejj@moonlight benchmarks]$ time ./output 21474836 | java -server -Xbatch sumcol2
master value: 370655678
370655678

real 0m10.625s
user 0m13.801s
sys 0m0.852s

It also now gets the correct result.

Unfortunately, it's still slower than the C# implementation running under Mono on my machine.

Anonymous said...

> That you have a vendetta against Java?

Actually, what we've learned is that the author is an honest person making an honest analysis of this particular benchmark test and providing us with honest figures (he updated the article with new information as people gave him suggestions).

He then clearly made an honest (and successful) effort to figure out why Java was broken, because his most recent comment above shows that he managed to fix the bug in the Java test and at the same time double the performance.

If he had a vendetta against Java, I doubt he would have bothered to do the research nor would he have posted a follow-up comment noting that it doubled the Java program's performance.

Certainly I've never once seen a pro-Java/anti-Mono person go to the lengths that this author has gone to in order to give such an objective and accurate analysis.

To Jeffrey Stedfast, I solute you sir.

Anonymous said...

lol

fejj, as usual you are insane when it comes to optimizing this stuff :p

Anonymous said...

sumcol-input.txt is so small that all it does is measure start-up time. C# has apparently faster startup time on your machine.
java client (not available on 32-bit machines) has a slightly faster startup time than java server. Shootout site used much bigger file. i.e sumcol-input.txt concatenated together several thousand times, according to shootout website.

Jeffrey Stedfast said...

Yea, that's what I was figuring as well (either that or I assumed they ran the programs several times - which is what it looks like they did with some of the tests at least).

That's also why I tested with my own input which was a lot larger.

Anonymous said...

"The problem is that StreamTokenizer.nval is a double, and not an int."

nval should be double since not all tokens would be int in all application. Making nval an int would be a bug. Why would double give the wrong answer given it's 64-bit compared to int which is 32-bit?

If you open StreamTokenizer.java(source file), who is the author and what version is it? Mine says @version 1.47, 11/30/05 by Gosling. Also, check how many versions of StreamTokenizer you do have on your computer.

Anonymous said...

"Java client (not available on 32-bit machines) "

I meant "not available on 64-bit machines". Of course java client is available on 32-bit machines.

Anonymous said...

I looked up on what Xbatch flag does:

"Using the -Xbatch flag on server processes that don't directly interact with the user might result in higher peak performance on single-processor servers. In general, using this flag hurts performance on multiprocessor machines because compilation could otherwise be off-loaded to a different processor."

Of course it probably makes no difference in this microbenchmark.

Anonymous said...

Try a test with the -server and -Xbatch removed from the Java command line. The server VM has a more expensive startup than the client VM.

Anonymous said...

server is much faster than client overall in most cases (sometimes almost twice faster). Even in this case, with larger files server is faster.

Jeffrey Stedfast said...

nval should be double since not all tokens would be int in all application.

Indeed. I never meant to suggest otherwise (sorry if that's what it sounded like I meant).

Making nval an int would be a bug. Why would double give the wrong answer given it's 64-bit compared to int which is 32-bit?

Okay, I guess I need to explain what was happening...

We have an integer variable, sum. We are adding to it a variable of type double named nval.

sum += nval;

The rules of adding a double to an int are that you cast the int to a double, then add, then cast back to int.

So, essentially, we have:

sum = (int) (((double) sum) + nval);

The problem comes in when the resultant double value (of sum + nval) exceeds Int32.Max. When cast back into an int, it truncates everything above Int32.Max.

The solution is to cast nval to an int so that the addition follows signed int32 arithmetic.

sum += (int) nval;

Jeffrey Stedfast said...

I tried without the -Xbatch flag and it didn't seem to make any difference at all for any of the inputs I tried in my experiment.

vexorian said...

You know what remains a mystery here? Why would we use any of them? Java is still not entirely GPL. Mono is a lead pony for MS technology. Not to mention both are quite broken in the performance side of things.

Perhaps we need to stop promoting broken technology? How about we at least try not to imitate MS technology as if it was the holy grail? FOSS certainly didn't need this bunch of lame stuff in its stack. Oh and If you want speed go C.

Jeffrey Stedfast said...

Vexorian: You might find this interesting...

[fejj@moonlight benchmarks]$ time mono sumcol2.exe < sumcol-input100000.txt
50000000

real 0m2.906s
user 0m2.572s
sys 0m0.308s
[fejj@moonlight benchmarks]$ time ./sumcol < sumcol-input100000.txt
50000000

real 0m13.967s
user 0m13.277s
sys 0m0.668s


Yes, that's a fully managed C# implementation beating the pants off the C implementation.

What was that you were saying?

Anonymous said...

lol, Vexorian just got schooled.

Don't worry about Vexorian too much, he's a tool and everyone knows it. He thinks he knows everything but when you get right down to it, he knows nothing. He couldn't program his way out of a paper bag so obviously his opinions on software engineering count for nothing.

Anonymous said...

You fell victim to one of the classic blunders! The most famous is never get involved in a land war in Asia, but only slightly less well-known is this: never go in against fejj when parsers or I/O performance is on the line! Ha ha ha ha ha ha ha! Ha ha ha ha ha ha ha! Ha ha ha...

Seriously tho, fejj is a god at writing high-performance parsers.

As we can see here, he'll own you.

Anonymous said...

My biggest issue is that all of the tests include warmup time in the results. I'd be curious to know what the steady state performance of the VMs are after the code has been JIT'ed (ie, run them a few times, pause, and then run again in the same VM instance).

Warmup time biases things in interesting ways that are seldom relevant to anything that I do with a VM.

Anonymous said...

Are you the same guy that optimized System.Console I/O performance about a year or so back? Something like 20x faster iirc.

If so, I have to agree with the Princess Bride comment above: this guy knows what he's doing.

Jeffrey Stedfast said...

Anonymous: I didn't mention this in my blog, but I ran the tests multiple times in order to offer the fairest runs I could.

That said, my experiment wasn't to prove that Mono was faster than Java, but rather to find the performance bottleneck in Mono in order to fix it to be comparable to Java in what the Debian Language Shootout benchmarks were indicating.

It just so happens that the results on my machine did not match those on the Language Shootout comparison.

It might be that the Debian Language Shootout used a 1.2.x version of Mono for the tests (like they say on some pages) rather than Mono 1.9.1 (like they say on other pages) for this particular test. In that case, their Mono version may not have had Miguel's port of my fast int parsing routines (which likely made a big difference - these should have made it in before 1.9.1 afaik) and/or their version of Mono might not have had some performance fixes that my local Mono has (I'm using a later version than 1.9.1).

The only thing I care about at the end of the experiment is that Mono is not lagging far behind Java, and I think I proved that.

Jeffrey Stedfast said...

Mike: yes, altho I think I only managed to improve it by 13x, not 20x :(

Jeffrey Stedfast said...

Michael Hutchinson has just forwarded me the following link which is interesting in that it shows you can't assume that something written in C or C++ will be any faster than something written in C# running in a VM.

http://blogs.msdn.com/ricom/archive/2005/05/19/420158.aspx

That link is for you, Vexorian.

Anonymous said...

"Anonymous: I didn't mention this in my blog, but I ran the tests multiple times in order to offer the fairest runs I could."

That's not what I meant. What I mean is that the "time" command will include the time required to start up the Java interpretter and to JIT the code. That is all time that I seldom care about, but Java tends to perform poorly in benchmarks that include warmup time.

A more interesting (to me) test is what was performance like after VM startup and JIT compilation?

And, of course, I get your point that a comprehensive comparison wasn't really what you were after... fair enough. :)

Jeffrey Stedfast said...

Ah, I see what you mean. Yes, those results would also be interesting.

I think that the startup time is largely overshadowed when testing against large datasets, though.

I think that the benchmark site has a startup time test in case that interests you.

Anonymous said...

fgets_unlocked does not use buffering, hence the slower speed in C.
If you were to rewrite it with buffering it would wipe the floor with C#.

Anonymous said...

Fastest possible C implementation? Using fgets()?

Are you kidding me?

Jeffrey Stedfast said...

To the first Anonymous: I tried fgets() instead of fgets_unlocked() and it was slower, not faster (slower by about 1 second).

To the most recent Anonymous: I said it was the fastest submitted to the Language Shootout - not the fastest possible.

If you ported my C# implementation to C, it would likely be faster than my C# results, but not by much.

Point is, the C# implementation is parsing roughly 500 MB in 2.5 seconds.

The maintenance cost of writing applications in C is simply not economical in comparison to C# when you can get near C-performance in C# if you really need it (and if you can't, you can just P/Invoke for the performance-critical bits).

Jeffrey Stedfast said...

Here are the results for the fastest C implementation I could come up with:

[fejj@moonlight benchmarks]$ time ./sumcol2 < sumcol-input100000.txt
50000000

real 0m2.072s
user 0m1.800s
sys 0m0.268s


That's a port of my C# implementation over to C using POSIX I/O (i.e. int file descriptors instead of FILE pointers) and pointer arithmetic.

I tried using FILE pointers + fread() and it was actually slower than the C# implementation (took a little over 3s to complete), so don't think I didn't try ;-)

I also tested using array indexes (as in a direct port from C#), but that was slower as well.

Here are the results for the original port:

[fejj@moonlight benchmarks]$ time ./sumcol2 < sumcol-input100000.txt
50000000

real 0m3.644s
user 0m3.344s
sys 0m0.288s


I'll upload my C implementation to http://www.gnome.org/~fejj/sumcol2.c so that you can see for yourself.

Jeffrey Stedfast said...

As someone just pointed out, I could probably get a performance improvement in C# if I used pointer arithmetic instead of array indexes as well (using an unsafe { } block), so we might end up seeing that C# can match the optimized C port.

Jeffrey Stedfast said...

Oh yea, and this was with gcc -O6 (which gave better results than -O2 - but to be fair, most distro's build with -O2).

Anonymous said...

Jeffrey:

You forgot to mention that if you subtract the startup overhead for the Mono implementation, that it would no doubt match the performance of the C version.

Since it is unlikely that the only task of any real program would be to sum the contents of a multi-gigabyte file (which is the only time a user might notice a difference in speed), the startup overhead becomes lost in the shuffle.

(I think this is also what the guy who brought up JVM startup overhead was inferring in his comments as well - obviously the same applies for Mono too).

Jeffrey Stedfast said...

An excellent point ;)

Yea, basically anyone that claims that the overhead of implementing an I/O bound parser in C# will be significantly slower than a C or C++ implementation is clueless.

That was the only real purpose of the highly optimized C# implementation comparison.

Note that I had anonymous trolls claiming that implementing a C# IMAP backend for Evolution could never come close to the performance of a C implementation - since an IMAP backend is 90% I/O bound, I think I've proven them wrong.

Anonymous said...

Let me guess, this anonymous troll wouldn't be Victor Soliz (aka Vexorian), would it?

Either way, I suspect what the anonymous troll really means is that he's too incompetent to solve these problems, and so assumes it can't be done.

Go easy on them, not everyone can be an uber programmer ;-)

That said, keep up the awesome performance work you've done - it makes my job easier being able to program in a high level language and not have to worry about performance!

Anonymous said...

DESCRIPTION
Each of these functions has the same behavior as its counterpart with‐
out the `_unlocked' suffix, except that they do not use locking (they
do not set locks themselves, and do not test for the presence of locks
set by others) and hence are thread-unsafe.

The anonymous guy who claimed that fgets_unlocked() is unbuffered is a moron.

Both fgets() and fgets_unlocked() are buffered.

Anonymous said...

"That's a port of my C# implementation"

It's not "your" fastest implementation. Your "C#" version is just a copy of Java#3 with slight changes from what I can see above.

Anonymous said...

You say java's Tokenizer version
is faster than C# but it only breaks for this input where it's slow..

[fejj@moonlight benchmarks]$ time ./output 21474836 | java sumcolmaster value: 3706556782147483647real 0m19.157suser 0m22.757s

Can you run this input fie with these options:

java -server -Xmx128m -Xms128m -Xmn120m sumcol

Anonymous said...

Since Java, C++ and C# are object oriented programming languages, I'm missing a benchmark for some OO behaviour.

Anonymous said...

The example C code will be quite slow because of its use of fgets().

I [put an optimized version here](http://rafb.net/p/sHBg7X54.html). It outperforms the C# version handily, while being significantly more compact. If C# as a language is "more economical" - and I'm not saying it isn't - then that isn't illustrated by this test.

Jeffrey Stedfast said...

Here are the results of your highly optimized implementation:

[fejj@moonlight benchmarks]$ gcc -arch 386 -funroll-all-loops -o his his.c
[fejj@moonlight benchmarks]$ time ./his < sumcol-input100000.txt
50000000

real 0m19.375s
user 0m4.692s
sys 0m0.596s
[fejj@moonlight benchmarks]$ time ./his < sumcol-input100000.txt
50000000

real 0m4.131s
user 0m3.772s
sys 0m0.352s
[fejj@moonlight benchmarks]$ time ./his < sumcol-input100000.txt
50000000

real 0m4.362s
user 0m3.808s
sys 0m0.544s


I ran it 3 times for good measure to make sure the cache was warmed up (like I did with my tests) and I'm sorry to say, but your C implementation hardly "handily outperforms" my C# implementation.

Just to be fair, I compiled your program with -O6 and tried again:

[fejj@moonlight benchmarks]$ gcc -O6 -o his his.c
[fejj@moonlight benchmarks]$ time ./his < sumcol-input100000.txt
50000000

real 0m2.855s
user 0m2.556s
sys 0m0.300s


With -O6 optimizations, it BARELY competes with the optimized C# implementation I wrote yesterday.

Here are the results for my C and C# implementations, again, for comparison:

[fejj@moonlight benchmarks]$ gcc -O6 -o sumcol2 sumcol2.c
[fejj@moonlight benchmarks]$ time ./sumcol2 < sumcol-input100000.txt
50000000

real 0m1.922s
user 0m1.684s
sys 0m0.232s
[fejj@moonlight benchmarks]$ gmcs sumcol2.cs
[fejj@moonlight benchmarks]$ time mono sumcol2.exe < sumcol-input100000.txt
50000000

real 0m2.613s
user 0m2.332s
sys 0m0.272s

Jeffrey Stedfast said...

I've just implemented an even faster C# implementation which you can find the source to at http://www.gnome.org/~fejj/sumcol3.cs

This new version uses pointers in C#.

You'll need to use `gmcs -unsafe sumcol3.cs` to comiile.

The results are as follows:

[fejj@moonlight benchmarks]$ time mono sumcol3.exe < sumcol-input100000.txt
50000000

real 0m1.878s
user 0m1.540s
sys 0m0.332s

That is on par with the fastest C implementation on my system.

Jeffrey Stedfast said...

It's not "your" fastest implementation. Your "C#" version is just a copy of Java#3 with slight changes from what I can see above.

They are quite different, but yes, they do use similar approaches, so I can assure you that it is "my" implementation. All of my parsers are written using a similar technique to the one I wrote in the C# implementation (see GMime, Evolution, Alleyoop, etc for examples).

All of those also use the optimization trick that makes my C# version outperform the java version.

For an instant replay, see the following results:

[fejj@moonlight benchmarks]$ time java -server sumcol3 < sumcol-input100000.txt
50000000

real 0m3.887s
user 0m2.916s
sys 0m0.196s
[fejj@moonlight benchmarks]$ time mono sumcol2.exe < sumcol-input100000.txt 50000000

real 0m2.697s
user 0m2.432s
sys 0m0.260s

Anonymous said...

Did you try

java -server -Xmx128m -Xms128m -Xmn120m sumcol

for input file that breaks StreamTokenizer?

Jeffrey Stedfast said...

Can you run this input fie with these options:

java -server -Xmx128m -Xms128m -Xmn120m sumcol


Sure, I'll humor you... but if you have to start tweaking runtime options from the standard options, then it's already a lost cause (same goes for having to compile the c programs with gcc optimization flags not commonly used by distributors... e.g. anything beyond -O2).

I ran it with those options 3 times, here are the results:

[fejj@moonlight benchmarks]$ time java -server -Xmx128m -Xms128m -Xmn120m sumcol < sumcol-input100000.txt
50000000

real 0m21.037s
user 0m20.221s
sys 0m0.892s
[fejj@moonlight benchmarks]$ time java -server -Xmx128m -Xms128m -Xmn120m sumcol < sumcol-input100000.txt
50000000

real 0m20.286s
user 0m19.949s
sys 0m0.440s
[fejj@moonlight benchmarks]$ time java -server -Xmx128m -Xms128m -Xmn120m sumcol < sumcol-input100000.txt
50000000

real 0m22.300s
user 0m21.241s
sys 0m1.124s

Doesn't seem to have made any difference, we're still seeing ~20s run times for java.

(fwiw, this was with Java6 #1, so not the tokenizer)

In case you meant the tokenizer version, the results are:

[fejj@moonlight benchmarks]$ time java -server -Xmx128m -Xms128m -Xmn120m sumcol2 < sumcol-input100000.txt
50000000

real 0m22.187s
user 0m21.797s
sys 0m0.340s
[fejj@moonlight benchmarks]$ time java -server -Xmx128m -Xms128m -Xmn120m sumcol2 < sumcol-input100000.txt
50000000

real 0m22.032s
user 0m21.717s
sys 0m0.244s
[fejj@moonlight benchmarks]$ time java -server -Xmx128m -Xms128m -Xmn120m sumcol2 < sumcol-input100000.txt
50000000

real 0m22.815s
user 0m22.021s
sys 0m0.740s

For comparison (since I don't think I ever tested Java6 #2 with this input before), the results without those VM flags are:

[fejj@moonlight benchmarks]$ time java -server sumcol2 < sumcol-input100000.txt50000000

real 0m22.750s
user 0m22.373s
sys 0m0.300s

So, it doesn't seem like those flags make any difference.

What do they even do?

Jeffrey Stedfast said...

Ah, those flags deal with initial memory pool sizes for the GC.

I'm sensing the problem is not with memory allocation, there's not much allocation going on in these benchmark programs.

My guess is that Java's VM or class libs for streams could use some optimization (maybe Sun could take a look at Mono's implementation and do something similar to what Mono does?).

Unfortunately I don't have the time or desire to dig into why Java is so slow for stream reading (or is it the int parsing?), but maybe someone who reads this blog might get inspired to look into it and fix Java.

Jeffrey Stedfast said...

Oops, I missed the -Os compile-option for the C program submitted to me.

Allow me to recompile and retest:

[fejj@moonlight benchmarks]$ gcc -Os -funroll-all-loops -o his his.c
[fejj@moonlight benchmarks]$ time ./his < sumcol-input100000.txt
50000000

real 0m2.200s
user 0m1.808s
sys 0m0.372s


Okay, that is now almost as fast as my fastest implementation using -O6 optimizations.

Let me try mine with these same compile flags and see if that changes my program's results:

[fejj@moonlight benchmarks]$ time ./sumcol2 < sumcol-input100000.txt
50000000

real 0m2.171s
user 0m1.828s
sys 0m0.344s

These options seem to have slowed mine down slightly...

If I rebuild both of them with -O2 (which is the likely ompile flags for most distros), then I get:
[fejj@moonlight benchmarks]$ gcc -O2 -o sumcol2 sumcol2.c
[fejj@moonlight benchmarks]$ gcc -O2 -o his his.c
[fejj@moonlight benchmarks]$ time ./sumcol2 < sumcol-input100000.txt
50000000

real 0m2.035s
user 0m1.664s
sys 0m0.352s
[fejj@moonlight benchmarks]$ time ./his < sumcol-input100000.txt
50000000

real 0m3.077s
user 0m2.592s
sys 0m0.428s


Anyways... interesting results. Looks like the -Os made a big difference for your program, but it still wasn't enough to outperform my faster C# implementation using pointers :p

You might be able to squeeze a little more performance out of your implementation if you used pointer arithmetic.

I think you'll also find that such a large BUFF_SIZE is not necessary, using 4096 instead of 409600 won't decrease performance at all, but it will greatly reduce memory usage. You could then also allocate it on the stack, removing the overhead cost of malloc()ing.

Anonymous said...

Which version of gcc are you using? What optimization flags were used to build your Mono installation?

Jeffrey Stedfast said...

[fejj@moonlight benchmarks]$ gcc --version
gcc (GCC) 4.2.1 (SUSE Linux)

[fejj@moonlight benchmarks]$ rpm -qa | grep gcc
gcc42-c++-4.2.1_20070724-17
gcc42-info-4.2.1_20070724-17
libgcc42-4.2.1_20070724-17
gcc42-4.2.1_20070724-17
gcc-info-4.2-24
gcc-4.2-24
gcc-c++-4.2-24


my Mono was built with the default CFLAGS for an OpenSuSE 10.3 system... I updated my Mono svn version a week or so ago and used the following command-line options:

./autogen.sh --prefix=/opt/mono --with-moonligh=yes && make && make install

If I examine the Makefile, the variables are defined as follows:

CFLAGS = -g -O2 -fno-strict-aliasing -Wdeclaration-after-statement -g -Wall -Wunused -Wmissing-prototypes -Wmissing-declarations -Wstrict-prototypes -Wmissing-prototypes -Wnested-externs -Wpointer-arith -Wno-cast-qual -Wcast-align -Wwrite-strings -mno-tls-direct-seg-refs
CFLAGS_FOR_BUILD = -g -O2

so it looks like, minus warning flags, the options are:

-g -O2 -fno-strict-aliasing -mno-tls-direct-seg-refs

(good thing I checked because I would have assumed straight -g -O2)

Anonymous said...

Jeffrey stedfast:
"All of those also use the optimization trick that makes my C# version outperform the java version"

Well, the Java version can be improved too. This one is a bit faster and smaller than Java#3

http://pastebin.com/f6525b4e5

Anonymous said...

Even better, the following is slightly faster than the above

http://pastebin.com/fdabb77b

Jeffrey Stedfast said...

Very nice! This implementation is definitely the best java implementation yet!

Here are the results of my 3 runs of your java implementation:

[fejj@moonlight benchmarks]$ time java -server sumcol4 < sumcol-input100000.txt
50000000

real 0m3.425s
user 0m3.172s
sys 0m0.204s
[fejj@moonlight benchmarks]$ time java -server sumcol4 < sumcol-input100000.txt
50000000

real 0m3.461s
user 0m3.168s
sys 0m0.240s
[fejj@moonlight benchmarks]$ time java -server sumcol4 < sumcol-input100000.txt
50000000

real 0m3.344s
user 0m3.172s
sys 0m0.168s


I think that what these comparisons show is that Mono's VM/class libs have a bit less abstraction over the underlying POSIX read() layer whereas Java might have much more.

I think the other major performance penalty that Java suffers is the bounds checking when compared to my C# w/pointers implementation, so it's to be expected that Java would have more overhead there.

Anyways, this has turned out to be an interesting experiment :)

Anonymous said...

Jeffrey Stedfast said
"the bounds checking when compared to my C# w/pointers "

It must be IO related. HotSpot -server JIT compiler can eliminate the bound checking in a loop like this if it can guarantee that the array index in within the bounds.

Have a look:

http://ei.cs.vt.edu/~cs5314/presentations/Group2PLDI.pdf

Anonymous said...

Hmm. the link was broken.

http://ei.cs.vt.edu/~cs5314/presentations/Group2PLDI.pdf

In case it breaks again, it's dot (.) pdf after Group2PLDI

Jeffrey Stedfast said...

Very interesting...

I guess that means it's all in the I/O overhead? That's some pretty insane overhead :(

Anonymous said...

What about Java vs Mono on unsigned datatypes. They are quite common when dealing with internet protocols and various file format and as they are a pain in the ass in java such a test would be quite interesting.

Anonymous said...

Just wanted to mention how good of an example this is of the worst of IT.

It is funny. Typical, someone posts a broken benchmark and then the wankers that find the results convenient blindly fall for it and take it as a fact then flame whoever disagrees. Amazing.

Ah well the blog is moderated so just the author is probably going to see this comment, still I got to say I found it very funny.

It is amazing to see people buying this cool-aid of mono being faster than C, I guess if you want to believe in something you will do it, regardless of any common sense. And someone linked to a msdn thread as a proof C# is faster than C! Awesome...

Jeffrey Stedfast said...

I don't see anywhere that says that Mono is faster than C. I merely proved that I/O from C# under Mono can be as fast as C.

Jeffrey Stedfast said...

I should mention that the MSDN link only showed that Microsoft's .NET can be as fast or faster than C/C++ in certain conditions.

Code Snippet Licensing

All code posted to this blog is licensed under the MIT/X11 license unless otherwise stated in the post itself.