The typical implementation for finding the nearest power of 2 for a given value is as follows:
static uint32_t nearest_pow (uint32_t num) { uint32_t n = 1; while (n < num) n <<= 1; return n; }
This implementation's performance, unfortunately, suffers as the value of num increases. Luckily there is another approach that takes a constant time no matter how large the value:
static uint32_t nearest_pow (uint32_t num) { uint32_t n = num > 0 ? num - 1 : 0; n |= n >> 1; n |= n >> 2; n |= n >> 4; n |= n >> 8; n |= n >> 16; n++; return n; }
A simple performance test might be:
int main (int argc, char **argv) { uint32_t i, n = 0; for (i = 0; i < INT_MAX / 10; i++) n += nearest_pow (i); return n > 0 ? 1 : 0; }
The run-time difference between the two implementations on my AMD Athlon (/proc/cpuinfo reports AMD Athlon(TM) XP 3200+ @ 2200.141 MHz) is impressive. For performance testing, I compiled with gcc -O2 which I figure is the typical default for most packaged software on Linux distributions.
The brain-dead approach has the following results:
[fejj@serenity cvs]$ time ./pow real 0m12.034s user 0m11.809s sys 0m0.032s
The bitwise implementation is insanely fast:
[fejj@serenity cvs]$ time ./pow2 real 0m1.361s user 0m1.304s sys 0m0.008s
Now... to be fair, the if you are using small values for num, then it's possible that the brain-dead approach might be faster. Let's try the same main() for-loop again, but this time let's request nearest_pow() with a value of 1 each time. Since it is likely that the results will be far too fast to really compare, let's also bump up the number of iterations to UINT_MAX.
[fejj@serenity cvs]$ time ./pow real 0m0.003s user 0m0.000s sys 0m0.004s [fejj@serenity cvs]$ time ./pow2 real 0m0.002s user 0m0.000s sys 0m0.000s
Unfortunately, both are still far too fast to really compare performance. Let's try bumping up the value of num to see if we can find the point at which the while-loop approach starts to fall behind the bitwise approach. To start, let's try passing the value of 2 as the num argument:
[fejj@serenity cvs]$ time ./pow real 0m0.002s user 0m0.000s sys 0m0.004s [fejj@serenity cvs]$ time ./pow2 real 0m0.002s user 0m0.000s sys 0m0.000s
It looks like the bitwise approach may be faster than the while-loop approach for the value of 2, but it's a bit hard to tell for sure with only UINT_MAX loops. We'd have to switch to using a 64bit i to know for sure and I'm not sure it's that important. Let's try 3 and see what we get:
[fejj@serenity cvs]$ time ./pow real 0m6.053s user 0m5.968s sys 0m0.004s [fejj@serenity cvs]$ time ./pow2 real 0m0.003s user 0m0.000s sys 0m0.004s
Well, hot diggity... I think we have ourselves a winner. This suggests that for all values of num larger than 2, the performance of the while-loop approach will be outmatched by the bitwise approach and that for values less-than-or-equal to 2, the performance is nearly identical.
Update: Thanks to the anonymous commenter that noticed that my original main() program was allowing the compiler to optimize out the call to nearest_pow() in the bitwise implementation. As suggested, I updated the for-loop to accumulate the output and then used it after the loop to avoid this problem. It only seemed to change the results for the bitwise implementation in the first test, however (before the change, it reported 0.002s). Still, on my machine it is approx. 10x faster for the first test case and seems to lose no performance even in the optimal conditions for the while-loop implementation.
Update2: I was just pointed to the Linux kernel's fls() implementation for x86. Here is a new implementation using inline assembler for x86:
static uint32_t nearest_pow (uint32_t num) { int bit; __asm__("bsrl %1,%0\n\t" "jnz 1f\n\t" "movl $-1,%0\n" "1:" : "=r" (bit) : "rm" (num)); return (1 << (bit + 1)); }
The results for the original INT_MAX / 10 iterations using i as the num argument yields the following results:
[fejj@serenity cvs]$ time ./pow3 real 0m1.335s user 0m1.296s sys 0m0.004s
The results seem negligibly faster than the C bitwise implementation and obviously less portable :(
Update3: A friend of mine, Stephane Delcroix, has just pointed me at a solution to this problem that he came up the other day:
static uint32_t nearest_pow (uint32_t num) { uint32_t j, k; (j = num & 0xFFFF0000) || (j = num); (k = j & 0xFF00FF00) || (k = j); (j = k & 0xF0F0F0F0) || (j = k); (k = j & 0xCCCCCCCC) || (k = j); (j = k & 0xAAAAAAAA) || (j = k); return j << 1; }
The results of this implementation are as follows:
[fejj@serenity cvs]$ time ./pow4 real 0m1.249s user 0m1.204s sys 0m0.004s
This is actually faster than both the bitwise and the assembler implementations above!
There are two things to be aware of, though:
- When num is 0, the value of 0 is returned (which may not be desirable depending on what you are doing with it)
- If num is a power of 2, then instead of returning num, this implementation will return the next higher power of 2
24 comments:
BTW, are you aware of bithacks?
http://graphics.stanford.edu
/~seander/bithacks.html
It contains a few more recipes for doing log2 and rounding to powers of 2.
Isn't the naive way to use logarithms e.g.
p = log(n)/log(2)
and then raise 2 that power after rounding off p?
anonymous1: thanks for the link!
anonymous2: I just went with the GLib g_nearest_pow() approach for the naive implementation :)
Might be interesting to compare the log() approach to the 2 implementations I talked about on my blog... not sure if I really feel that motivated to try it tho ;-)
For the very last timing: Are you sure that the optimizer didn't make the whole calculation go away?
The call to nearest_pow() is getting optimized out in the second version. Easy fix is to acculumate the results in a second variable and then print it out at the end of the test. For example:
uint32_t i, result = 0;
for (i = 0; i < INT_MAX / 10; i++)
result += nearest_pow3 (i);
printf("%u\n", result);
Oh, and I ran it locally with the fixed version. It's still faster but only about 5 times faster, which for a rarely used function probably doesn't matter all that much.
Some ARM processors have a 'clz' instruction to count the leading zeros in a value in a register. Is there an equivalent instruction on X86? Multimedia extension instructions?
anonymous: good catch wrt compiler optimizing out the call to nearest_pow()! I forgot to check the assembler output of gcc to verify that -O2 didn't do that.
I've updated the article with your suggestion.
Thanks again.
You should look at the Linux implementation of ilog2. It contains a GCC-specific optimization that makes it compute the result at compile-time if the compiler knows the input at compile-time.
Why not look it up and give credit?
http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious
I wonder if using "cmovz $-1, %0" would be at all faster than doing the branch.
> Why not look it up and give credit?
>
> http://graphics.stanford.edu
> /~seander/bithacks.html
> #IntegerLogObvious
Probably because looping a bit-shift until it's >= the input value is the most obvious solution to the problem that any first-year CS student could come up with it?
None of the solutions presented on this blog are rocket science, I've seen them all used since back in the early-to-mid 90's and the stanford.edu site was only published in 2001.
n/m, 1997-2005, I misunderstood when he dated one of the entries as 2001 to mean he published it 2001.
Even so, the bit-shift loop has probably been used since the days of the PDP-11. I wouldn't be surprised if the bitwise implementation wasn't used since around the same time.
You can write more portable almost as efficient as assembly code with __builtin_clzl.
I understand the allure of this exercise, but where is this ever done often enough to be performance-critical?
I really think such code is a _bad_ thing. Your fast version is practically unreadable. Sure, it might be just a little bit faster, but is the speed gain really worth it when you introduce such hard to read code? IMHO one of the most important properties of good code is that other people can read and understand it.
The solution from update 3 is probably not the fastest generally speaking. It is slightly slower that the shift version on my Core2.
Do not forget that || is in fact a disguised IF statement so this algorithm implies several conditional jumps.
My guess is that even if that algorithm can be the fastest when implemented as a standalone function, it is likely to be the slowest in real life situations where the function body is going to be inlined.
Unless the compiler is really clever, there will be far more optimization opportunities when inlining a sequence of shifts than for a sequence of conditional branches.
I fully understand that there are branches in Stephane's implementation, which is why I was surprised it was faster.
FWIW, gcc -O2 does inline the nearest_pow() function for all the implementations tried on this page.
That said, I would tend to agree that it's probably not the most ideal implementation. I mostly shared it because I thought it was a cool approach to the problem :)
Soren: I think that all of the C implementations on this page are quite readable. If they came with a comment about what they were supposed to do and/or a comment explaining how they worked, then there's no reason not to use these code snippets.
Different things are obvious to some people and non-obvious to others. That's what comments are for :)
Oi Jeff, Hmm, i've seen something like this somewhere before ... ;-)
http://blogs.gnome.org/zucchi/2008/03/12/unexpected-bits-of-bits/
and
http://blogs.gnome.org/zucchi/2008/03/18/spewing-bits/
Although your version looks better I think. I wonder how it'd go on cell though, fewer instructions, but every instruction depends on every previous one so it might not be any better.
Power of 2 is quite a common operation with some types of code that needs to run as fast as possible, e.g. memory allocators (why I was looking at it), or bitmap tables (databases, vm page tables). It *is* worth the effort to make it run as fast as possible for this type of code.
Ever stopped to consider that no CPU design would waste the silicon on it if it wasn't useful?
I know I'm jumping a bit late on the wagon, but I thought I'd share why this works. There are two ways of finding out the greatest power of a number (here, 2) in say another number.
a) divide by 2 continually until it equals 1.
b) sum the quotients of increasing *powers* of 2 till it stops making a difference, i.e.:
x/2 + x/4 + x/8 + x/16 ...
It's clear which would be more optimal; the second one (since division by 2 is constant time), and these are respectively the original and your methods.
A little maths (keep in mind that with bit shifts, we are dealing with powers of 2) will show that:
a) runs in O(log n) : Each op is a bit shift. You have:
n << 1
n << 1
.... x times.
or 2^x = n || x = log n
b) runs in O(sqrt (log (n))) : Each op is an increasing bit shift. You have:
n << 1 | n << 2 | n << 3... (1+2+3....x times) = log n. The series 1+2+3... has a very well know summation, (x)(x+1)/2.
Let's keep it simple, and say it's something x^2. That means x^2 = log n., hense x = sqrt (log (n)).
Just wanted to share that. Hope it doesn't go over too many heads.
I would just like to follow up my previous comment and say that it is the most utter rubbish I've ever thought of, and I just wanted to jump the gun before somebody flames me for my stupidity.
The method works simply by efficiently filling up all the bits before it, and adding to move it to the next bit column.
In partial defence, I just noticed that method #4 (Stephane Delcroix's), is the most beautiful implmentation of a binary tree-ish algorithm I've ever seen. Hats off to Stephane.
I would have to agree about Stephane's awesomeness ;-)
This was very useful for me, thanks !
This was very useful, thanks!
Post a Comment