tag:blogger.com,1999:blog-203063759820106893.post7924015203590954586..comments2023-03-25T08:30:26.602-04:00Comments on A Moment of Zen: Debian Language Benchmarks - SumFileJeffrey Stedfasthttp://www.blogger.com/profile/12271561115384429651noreply@blogger.comBlogger79125tag:blogger.com,1999:blog-203063759820106893.post-85875873455885150212009-03-08T00:09:00.000-05:002009-03-08T00:09:00.000-05:00I should mention that the MSDN link only showed th...I should mention that the MSDN link only showed that Microsoft's .NET can be as fast or faster than C/C++ in certain conditions.Jeffrey Stedfasthttps://www.blogger.com/profile/12271561115384429651noreply@blogger.comtag:blogger.com,1999:blog-203063759820106893.post-29621549837516358452009-03-08T00:08:00.000-05:002009-03-08T00:08:00.000-05:00I don't see anywhere that says that Mono is faster...I don't see anywhere that says that Mono is faster than C. I merely proved that I/O from C# under Mono can be as fast as C.Jeffrey Stedfasthttps://www.blogger.com/profile/12271561115384429651noreply@blogger.comtag:blogger.com,1999:blog-203063759820106893.post-30521871427813433432009-03-07T22:02:00.000-05:002009-03-07T22:02:00.000-05:00Just wanted to mention how good of an example this...Just wanted to mention how good of an example this is of the worst of IT.<BR/><BR/>It is funny. Typical, someone posts a broken benchmark and then the wankers that find the results convenient blindly fall for it and take it as a fact then flame whoever disagrees. Amazing.<BR/><BR/>Ah well the blog is moderated so just the author is probably going to see this comment, still I got to say I found it very funny.<BR/><BR/>It is amazing to see people buying this cool-aid of mono being faster than C, I guess if you want to believe in something you will do it, regardless of any common sense. And someone linked to a msdn thread as a proof C# is faster than C! Awesome...Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-203063759820106893.post-85172220576837458572008-05-29T17:22:00.000-04:002008-05-29T17:22:00.000-04:00What about Java vs Mono on unsigned datatypes. The...What about Java vs Mono on unsigned datatypes. They are quite common when dealing with internet protocols and various file format and as they are a pain in the ass in java such a test would be quite interesting.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-203063759820106893.post-52785416623093696082008-05-20T17:37:00.000-04:002008-05-20T17:37:00.000-04:00Very interesting... I guess that means it's all in...Very interesting... <BR/><BR/>I guess that means it's all in the I/O overhead? That's some pretty insane overhead :(Jeffrey Stedfasthttps://www.blogger.com/profile/12271561115384429651noreply@blogger.comtag:blogger.com,1999:blog-203063759820106893.post-2251784502769556692008-05-20T16:12:00.000-04:002008-05-20T16:12:00.000-04:00Hmm. the link was broken. http://ei.cs.vt.edu/~cs5...Hmm. the link was broken. <BR/><BR/>http://ei.cs.vt.edu/~cs5314/presentations/Group2PLDI.pdf<BR/><BR/>In case it breaks again, it's dot (.) pdf after Group2PLDIAnonymousnoreply@blogger.comtag:blogger.com,1999:blog-203063759820106893.post-14512679577538098542008-05-20T16:09:00.000-04:002008-05-20T16:09:00.000-04:00Jeffrey Stedfast said"the bounds checking when com...Jeffrey Stedfast said<BR/>"the bounds checking when compared to my C# w/pointers "<BR/><BR/>It must be IO related. HotSpot -server JIT compiler can eliminate the bound checking in a loop like this if it can guarantee that the array index in within the bounds. <BR/><BR/>Have a look: <BR/><BR/>http://ei.cs.vt.edu/~cs5314/presentations/Group2PLDI.pdfAnonymousnoreply@blogger.comtag:blogger.com,1999:blog-203063759820106893.post-57187817781585963132008-05-20T14:51:00.000-04:002008-05-20T14:51:00.000-04:00Very nice! This implementation is definitely the b...Very nice! This implementation is definitely the best java implementation yet!<BR/><BR/>Here are the results of my 3 runs of your java implementation:<BR/><BR/>[fejj@moonlight benchmarks]$ time java -server sumcol4 < sumcol-input100000.txt <BR/>50000000<BR/><BR/>real 0m3.425s<BR/>user 0m3.172s<BR/>sys 0m0.204s<BR/>[fejj@moonlight benchmarks]$ time java -server sumcol4 < sumcol-input100000.txt <BR/>50000000<BR/><BR/>real 0m3.461s<BR/>user 0m3.168s<BR/>sys 0m0.240s<BR/>[fejj@moonlight benchmarks]$ time java -server sumcol4 < sumcol-input100000.txt <BR/>50000000<BR/><BR/>real 0m3.344s<BR/>user 0m3.172s<BR/>sys 0m0.168s<BR/><BR/><BR/>I think that what these comparisons show is that Mono's VM/class libs have a bit less abstraction over the underlying POSIX read() layer whereas Java might have much more.<BR/><BR/>I think the other major performance penalty that Java suffers is the bounds checking when compared to my C# w/pointers implementation, so it's to be expected that Java would have more overhead there.<BR/><BR/>Anyways, this has turned out to be an interesting experiment :)Jeffrey Stedfasthttps://www.blogger.com/profile/12271561115384429651noreply@blogger.comtag:blogger.com,1999:blog-203063759820106893.post-68065876491016673092008-05-20T13:04:00.000-04:002008-05-20T13:04:00.000-04:00Even better, the following is slightly faster than...Even better, the following is slightly faster than the above <BR/><BR/>http://pastebin.com/fdabb77bAnonymousnoreply@blogger.comtag:blogger.com,1999:blog-203063759820106893.post-2634454123277210022008-05-20T12:29:00.000-04:002008-05-20T12:29:00.000-04:00Jeffrey stedfast:"All of those also use the optimi...Jeffrey stedfast:<BR/>"All of those also use the optimization trick that makes my C# version outperform the java version"<BR/><BR/>Well, the Java version can be improved too. This one is a bit faster and smaller than Java#3<BR/><BR/>http://pastebin.com/f6525b4e5Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-203063759820106893.post-35407907530320647812008-05-20T10:38:00.000-04:002008-05-20T10:38:00.000-04:00[fejj@moonlight benchmarks]$ gcc --versiongcc (GCC...[fejj@moonlight benchmarks]$ gcc --version<BR/>gcc (GCC) 4.2.1 (SUSE Linux)<BR/><BR/>[fejj@moonlight benchmarks]$ rpm -qa | grep gcc<BR/>gcc42-c++-4.2.1_20070724-17<BR/>gcc42-info-4.2.1_20070724-17<BR/>libgcc42-4.2.1_20070724-17<BR/>gcc42-4.2.1_20070724-17<BR/>gcc-info-4.2-24<BR/>gcc-4.2-24<BR/>gcc-c++-4.2-24<BR/><BR/><BR/>my Mono was built with the default CFLAGS for an OpenSuSE 10.3 system... I updated my Mono svn version a week or so ago and used the following command-line options:<BR/><BR/>./autogen.sh --prefix=/opt/mono --with-moonligh=yes && make && make install<BR/><BR/>If I examine the Makefile, the variables are defined as follows:<BR/><BR/>CFLAGS = -g -O2 -fno-strict-aliasing -Wdeclaration-after-statement -g -Wall -Wunused -Wmissing-prototypes -Wmissing-declarations -Wstrict-prototypes -Wmissing-prototypes -Wnested-externs -Wpointer-arith -Wno-cast-qual -Wcast-align -Wwrite-strings -mno-tls-direct-seg-refs<BR/>CFLAGS_FOR_BUILD = -g -O2<BR/><BR/>so it looks like, minus warning flags, the options are:<BR/><BR/>-g -O2 -fno-strict-aliasing -mno-tls-direct-seg-refs<BR/><BR/>(good thing I checked because I would have assumed straight -g -O2)Jeffrey Stedfasthttps://www.blogger.com/profile/12271561115384429651noreply@blogger.comtag:blogger.com,1999:blog-203063759820106893.post-5343488333739325882008-05-20T10:30:00.000-04:002008-05-20T10:30:00.000-04:00Which version of gcc are you using? What optimizat...Which version of gcc are you using? What optimization flags were used to build your Mono installation?Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-203063759820106893.post-85539216530048959322008-05-20T09:55:00.000-04:002008-05-20T09:55:00.000-04:00Oops, I missed the -Os compile-option for the C pr...Oops, I missed the -Os compile-option for the C program submitted to me.<BR/><BR/>Allow me to recompile and retest:<BR/><BR/>[fejj@moonlight benchmarks]$ gcc -Os -funroll-all-loops -o his his.c<BR/>[fejj@moonlight benchmarks]$ time ./his < sumcol-input100000.txt <BR/>50000000<BR/><BR/>real 0m2.200s<BR/>user 0m1.808s<BR/>sys 0m0.372s<BR/><BR/><BR/>Okay, that is now almost as fast as my fastest implementation using -O6 optimizations.<BR/><BR/>Let me try mine with these same compile flags and see if that changes my program's results:<BR/><BR/>[fejj@moonlight benchmarks]$ time ./sumcol2 < sumcol-input100000.txt <BR/>50000000<BR/><BR/>real 0m2.171s<BR/>user 0m1.828s<BR/>sys 0m0.344s<BR/><BR/>These options seem to have slowed mine down slightly...<BR/><BR/>If I rebuild both of them with -O2 (which is the likely ompile flags for most distros), then I get:<BR/>[fejj@moonlight benchmarks]$ gcc -O2 -o sumcol2 sumcol2.c<BR/>[fejj@moonlight benchmarks]$ gcc -O2 -o his his.c<BR/>[fejj@moonlight benchmarks]$ time ./sumcol2 < sumcol-input100000.txt <BR/>50000000<BR/><BR/>real 0m2.035s<BR/>user 0m1.664s<BR/>sys 0m0.352s<BR/>[fejj@moonlight benchmarks]$ time ./his < sumcol-input100000.txt <BR/>50000000<BR/><BR/>real 0m3.077s<BR/>user 0m2.592s<BR/>sys 0m0.428s<BR/><BR/><BR/>Anyways... interesting results. Looks like the -Os made a big difference for your program, but it still wasn't enough to outperform my faster C# implementation using pointers :p<BR/><BR/>You might be able to squeeze a little more performance out of your implementation if you used pointer arithmetic.<BR/><BR/>I think you'll also find that such a large BUFF_SIZE is not necessary, using 4096 instead of 409600 won't decrease performance at all, but it will greatly reduce memory usage. You could then also allocate it on the stack, removing the overhead cost of malloc()ing.Jeffrey Stedfasthttps://www.blogger.com/profile/12271561115384429651noreply@blogger.comtag:blogger.com,1999:blog-203063759820106893.post-84963036293576255802008-05-20T09:31:00.000-04:002008-05-20T09:31:00.000-04:00Ah, those flags deal with initial memory pool size...Ah, those flags deal with initial memory pool sizes for the GC.<BR/><BR/>I'm sensing the problem is not with memory allocation, there's not much allocation going on in these benchmark programs.<BR/><BR/>My guess is that Java's VM or class libs for streams could use some optimization (maybe Sun could take a look at Mono's implementation and do something similar to what Mono does?).<BR/><BR/>Unfortunately I don't have the time or desire to dig into why Java is so slow for stream reading (or is it the int parsing?), but maybe someone who reads this blog might get inspired to look into it and fix Java.Jeffrey Stedfasthttps://www.blogger.com/profile/12271561115384429651noreply@blogger.comtag:blogger.com,1999:blog-203063759820106893.post-1527580507157137352008-05-20T08:59:00.000-04:002008-05-20T08:59:00.000-04:00Can you run this input fie with these options:java...<I>Can you run this input fie with these options:<BR/><BR/>java -server -Xmx128m -Xms128m -Xmn120m sumcol</I><BR/><BR/>Sure, I'll humor you... but if you have to start tweaking runtime options from the standard options, then it's already a lost cause (same goes for having to compile the c programs with gcc optimization flags not commonly used by distributors... e.g. anything beyond -O2).<BR/><BR/>I ran it with those options 3 times, here are the results:<BR/><BR/>[fejj@moonlight benchmarks]$ time java -server -Xmx128m -Xms128m -Xmn120m sumcol < sumcol-input100000.txt<BR/>50000000<BR/><BR/>real 0m21.037s<BR/>user 0m20.221s<BR/>sys 0m0.892s<BR/>[fejj@moonlight benchmarks]$ time java -server -Xmx128m -Xms128m -Xmn120m sumcol < sumcol-input100000.txt<BR/>50000000<BR/><BR/>real 0m20.286s<BR/>user 0m19.949s<BR/>sys 0m0.440s<BR/>[fejj@moonlight benchmarks]$ time java -server -Xmx128m -Xms128m -Xmn120m sumcol < sumcol-input100000.txt<BR/>50000000<BR/><BR/>real 0m22.300s<BR/>user 0m21.241s<BR/>sys 0m1.124s<BR/><BR/>Doesn't seem to have made any difference, we're still seeing ~20s run times for java.<BR/><BR/>(fwiw, this was with Java6 #1, so not the tokenizer)<BR/><BR/>In case you meant the tokenizer version, the results are:<BR/><BR/>[fejj@moonlight benchmarks]$ time java -server -Xmx128m -Xms128m -Xmn120m sumcol2 < sumcol-input100000.txt<BR/>50000000<BR/><BR/>real 0m22.187s<BR/>user 0m21.797s<BR/>sys 0m0.340s<BR/>[fejj@moonlight benchmarks]$ time java -server -Xmx128m -Xms128m -Xmn120m sumcol2 < sumcol-input100000.txt<BR/>50000000<BR/><BR/>real 0m22.032s<BR/>user 0m21.717s<BR/>sys 0m0.244s<BR/>[fejj@moonlight benchmarks]$ time java -server -Xmx128m -Xms128m -Xmn120m sumcol2 < sumcol-input100000.txt<BR/>50000000<BR/><BR/>real 0m22.815s<BR/>user 0m22.021s<BR/>sys 0m0.740s<BR/><BR/>For comparison (since I don't think I ever tested Java6 #2 with this input before), the results without those VM flags are:<BR/><BR/>[fejj@moonlight benchmarks]$ time java -server sumcol2 < sumcol-input100000.txt50000000<BR/><BR/>real 0m22.750s<BR/>user 0m22.373s<BR/>sys 0m0.300s<BR/><BR/>So, it doesn't seem like those flags make any difference.<BR/><BR/>What do they even do?Jeffrey Stedfasthttps://www.blogger.com/profile/12271561115384429651noreply@blogger.comtag:blogger.com,1999:blog-203063759820106893.post-21395457131769048112008-05-20T08:46:00.000-04:002008-05-20T08:46:00.000-04:00Did you try java -server -Xmx128m -Xms128m -Xmn120...Did you try <BR/><BR/>java -server -Xmx128m -Xms128m -Xmn120m sumcol<BR/><BR/>for input file that breaks StreamTokenizer?Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-203063759820106893.post-67930304882694419812008-05-20T08:45:00.000-04:002008-05-20T08:45:00.000-04:00It's not "your" fastest implementation. Your "C#" ...<I>It's not "your" fastest implementation. Your "C#" version is just a copy of Java#3 with slight changes from what I can see above.</I><BR/><BR/>They are quite different, but yes, they do use similar approaches, so I can assure you that it is "my" implementation. All of my parsers are written using a similar technique to the one I wrote in the C# implementation (see GMime, Evolution, Alleyoop, etc for examples).<BR/><BR/>All of those also use the optimization trick that makes my C# version outperform the java version.<BR/><BR/>For an instant replay, see the following results:<BR/><BR/>[fejj@moonlight benchmarks]$ time java -server sumcol3 < sumcol-input100000.txt <BR/>50000000<BR/><BR/>real 0m3.887s<BR/>user 0m2.916s<BR/>sys 0m0.196s<BR/>[fejj@moonlight benchmarks]$ time mono sumcol2.exe < sumcol-input100000.txt 50000000<BR/><BR/>real 0m2.697s<BR/>user 0m2.432s<BR/>sys 0m0.260sJeffrey Stedfasthttps://www.blogger.com/profile/12271561115384429651noreply@blogger.comtag:blogger.com,1999:blog-203063759820106893.post-15012872109152037192008-05-20T08:37:00.000-04:002008-05-20T08:37:00.000-04:00I've just implemented an even faster C# implementa...I've just implemented an even faster C# implementation which you can find the source to at <A HREF="http://www.gnome.org/~fejj/sumcol3.cs" REL="nofollow">http://www.gnome.org/~fejj/sumcol3.cs</A><BR/><BR/>This new version uses pointers in C#.<BR/><BR/>You'll need to use `gmcs -unsafe sumcol3.cs` to comiile.<BR/><BR/>The results are as follows:<BR/><BR/>[fejj@moonlight benchmarks]$ time mono sumcol3.exe < sumcol-input100000.txt <BR/>50000000<BR/><BR/>real 0m1.878s<BR/>user 0m1.540s<BR/>sys 0m0.332s<BR/><BR/>That is on par with the fastest C implementation on my system.Jeffrey Stedfasthttps://www.blogger.com/profile/12271561115384429651noreply@blogger.comtag:blogger.com,1999:blog-203063759820106893.post-80481244979125902762008-05-20T08:20:00.000-04:002008-05-20T08:20:00.000-04:00Here are the results of your highly optimized impl...Here are the results of your highly optimized implementation:<BR/><BR/>[fejj@moonlight benchmarks]$ gcc -arch 386 -funroll-all-loops -o his his.c<BR/>[fejj@moonlight benchmarks]$ time ./his < sumcol-input100000.txt <BR/>50000000<BR/><BR/>real 0m19.375s<BR/>user 0m4.692s<BR/>sys 0m0.596s<BR/>[fejj@moonlight benchmarks]$ time ./his < sumcol-input100000.txt <BR/>50000000<BR/><BR/>real 0m4.131s<BR/>user 0m3.772s<BR/>sys 0m0.352s<BR/>[fejj@moonlight benchmarks]$ time ./his < sumcol-input100000.txt <BR/>50000000<BR/><BR/>real 0m4.362s<BR/>user 0m3.808s<BR/>sys 0m0.544s<BR/><BR/><BR/>I ran it 3 times for good measure to make sure the cache was warmed up (like I did with my tests) and I'm sorry to say, but your C implementation hardly "handily outperforms" my C# implementation.<BR/><BR/>Just to be fair, I compiled your program with -O6 and tried again:<BR/><BR/>[fejj@moonlight benchmarks]$ gcc -O6 -o his his.c<BR/>[fejj@moonlight benchmarks]$ time ./his < sumcol-input100000.txt <BR/>50000000<BR/><BR/>real 0m2.855s<BR/>user 0m2.556s<BR/>sys 0m0.300s<BR/><BR/><BR/>With -O6 optimizations, it BARELY competes with the optimized C# implementation I wrote yesterday.<BR/><BR/>Here are the results for my C and C# implementations, again, for comparison:<BR/><BR/>[fejj@moonlight benchmarks]$ gcc -O6 -o sumcol2 sumcol2.c<BR/>[fejj@moonlight benchmarks]$ time ./sumcol2 < sumcol-input100000.txt <BR/>50000000<BR/><BR/>real 0m1.922s<BR/>user 0m1.684s<BR/>sys 0m0.232s<BR/>[fejj@moonlight benchmarks]$ gmcs sumcol2.cs <BR/>[fejj@moonlight benchmarks]$ time mono sumcol2.exe < sumcol-input100000.txt <BR/>50000000<BR/><BR/>real 0m2.613s<BR/>user 0m2.332s<BR/>sys 0m0.272sJeffrey Stedfasthttps://www.blogger.com/profile/12271561115384429651noreply@blogger.comtag:blogger.com,1999:blog-203063759820106893.post-18148061362070982242008-05-20T02:40:00.000-04:002008-05-20T02:40:00.000-04:00The example C code will be quite slow because of i...The example C code will be quite slow because of its use of fgets().<BR/><BR/>I [put an optimized version here](http://rafb.net/p/sHBg7X54.html). It outperforms the C# version handily, while being significantly more compact. If C# as a language is "more economical" - and I'm not saying it isn't - then that isn't illustrated by this test.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-203063759820106893.post-44380474493174934932008-05-20T02:35:00.000-04:002008-05-20T02:35:00.000-04:00Since Java, C++ and C# are object oriented program...Since Java, C++ and C# are object oriented programming languages, I'm missing a benchmark for some OO behaviour.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-203063759820106893.post-69767058445569598542008-05-20T00:46:00.000-04:002008-05-20T00:46:00.000-04:00You say java's Tokenizer versionis faster than C# ...You say java's Tokenizer version<BR/>is faster than C# but it only breaks for this input where it's slow..<BR/><BR/>[fejj@moonlight benchmarks]$ time ./output 21474836 | java sumcolmaster value: 3706556782147483647real 0m19.157suser 0m22.757s<BR/><BR/>Can you run this input fie with these options:<BR/><BR/>java -server -Xmx128m -Xms128m -Xmn120m sumcolAnonymousnoreply@blogger.comtag:blogger.com,1999:blog-203063759820106893.post-22972020989991143712008-05-20T00:18:00.000-04:002008-05-20T00:18:00.000-04:00"That's a port of my C# implementation"It's not "y..."That's a port of my C# implementation"<BR/><BR/>It's not "your" fastest implementation. Your "C#" version is just a copy of Java#3 with slight changes from what I can see above.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-203063759820106893.post-12151095972697364572008-05-19T22:49:00.000-04:002008-05-19T22:49:00.000-04:00DESCRIPTION Each of these functions has the ...DESCRIPTION<BR/> Each of these functions has the same behavior as its counterpart with‐<BR/> out the `_unlocked' suffix, except that they do not use locking (they<BR/> do not set locks themselves, and do not test for the presence of locks<BR/> set by others) and hence are thread-unsafe.<BR/><BR/>The anonymous guy who claimed that fgets_unlocked() is unbuffered is a moron.<BR/><BR/>Both fgets() and fgets_unlocked() are buffered.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-203063759820106893.post-61753682297722594662008-05-19T20:18:00.000-04:002008-05-19T20:18:00.000-04:00Let me guess, this anonymous troll wouldn't be Vic...Let me guess, this anonymous troll wouldn't be Victor Soliz (aka Vexorian), would it?<BR/><BR/>Either way, I suspect what the anonymous troll really means is that he's too incompetent to solve these problems, and so assumes it can't be done.<BR/><BR/>Go easy on them, not everyone can be an uber programmer ;-)<BR/><BR/>That said, keep up the awesome performance work you've done - it makes my job easier being able to program in a high level language and not have to worry about performance!Anonymousnoreply@blogger.com