Saturday, June 14, 2008

Calculating the Nearest Power of 2

The typical implementation for finding the nearest power of 2 for a given value is as follows:

static uint32_t
nearest_pow (uint32_t num)
{
    uint32_t n = 1;

    while (n < num)
        n <<= 1;

    return n;
}

This implementation's performance, unfortunately, suffers as the value of num increases. Luckily there is another approach that takes a constant time no matter how large the value:

static uint32_t
nearest_pow (uint32_t num)
{
    uint32_t n = num > 0 ? num - 1 : 0;

    n |= n >> 1;
    n |= n >> 2;
    n |= n >> 4;
    n |= n >> 8;
    n |= n >> 16;
    n++;

    return n;
}

A simple performance test might be:

int main (int argc, char **argv)
{
    uint32_t i, n = 0;

    for (i = 0; i < INT_MAX / 10; i++)
        n += nearest_pow (i);

    return n > 0 ? 1 : 0;
}

The run-time difference between the two implementations on my AMD Athlon (/proc/cpuinfo reports AMD Athlon(TM) XP 3200+ @ 2200.141 MHz) is impressive. For performance testing, I compiled with gcc -O2 which I figure is the typical default for most packaged software on Linux distributions.

The brain-dead approach has the following results:

[fejj@serenity cvs]$ time ./pow

real 0m12.034s
user 0m11.809s
sys 0m0.032s

The bitwise implementation is insanely fast:

[fejj@serenity cvs]$ time ./pow2

real 0m1.361s
user 0m1.304s
sys 0m0.008s

Now... to be fair, the if you are using small values for num, then it's possible that the brain-dead approach might be faster. Let's try the same main() for-loop again, but this time let's request nearest_pow() with a value of 1 each time. Since it is likely that the results will be far too fast to really compare, let's also bump up the number of iterations to UINT_MAX.

[fejj@serenity cvs]$ time ./pow

real 0m0.003s
user 0m0.000s
sys 0m0.004s
[fejj@serenity cvs]$ time ./pow2

real 0m0.002s
user 0m0.000s
sys 0m0.000s

Unfortunately, both are still far too fast to really compare performance. Let's try bumping up the value of num to see if we can find the point at which the while-loop approach starts to fall behind the bitwise approach. To start, let's try passing the value of 2 as the num argument:

[fejj@serenity cvs]$ time ./pow

real 0m0.002s
user 0m0.000s
sys 0m0.004s
[fejj@serenity cvs]$ time ./pow2

real 0m0.002s
user 0m0.000s
sys 0m0.000s

It looks like the bitwise approach may be faster than the while-loop approach for the value of 2, but it's a bit hard to tell for sure with only UINT_MAX loops. We'd have to switch to using a 64bit i to know for sure and I'm not sure it's that important. Let's try 3 and see what we get:

[fejj@serenity cvs]$ time ./pow

real 0m6.053s
user 0m5.968s
sys 0m0.004s
[fejj@serenity cvs]$ time ./pow2

real 0m0.003s
user 0m0.000s
sys 0m0.004s

Well, hot diggity... I think we have ourselves a winner. This suggests that for all values of num larger than 2, the performance of the while-loop approach will be outmatched by the bitwise approach and that for values less-than-or-equal to 2, the performance is nearly identical.

Update: Thanks to the anonymous commenter that noticed that my original main() program was allowing the compiler to optimize out the call to nearest_pow() in the bitwise implementation. As suggested, I updated the for-loop to accumulate the output and then used it after the loop to avoid this problem. It only seemed to change the results for the bitwise implementation in the first test, however (before the change, it reported 0.002s). Still, on my machine it is approx. 10x faster for the first test case and seems to lose no performance even in the optimal conditions for the while-loop implementation.

Update2: I was just pointed to the Linux kernel's fls() implementation for x86. Here is a new implementation using inline assembler for x86:

static uint32_t
nearest_pow (uint32_t num)
{
    int bit;

    __asm__("bsrl %1,%0\n\t"
            "jnz 1f\n\t"
            "movl $-1,%0\n"
            "1:" : "=r" (bit) : "rm" (num));

    return (1 << (bit + 1));
}

The results for the original INT_MAX / 10 iterations using i as the num argument yields the following results:

[fejj@serenity cvs]$ time ./pow3

real 0m1.335s
user 0m1.296s
sys 0m0.004s

The results seem negligibly faster than the C bitwise implementation and obviously less portable :(

Update3: A friend of mine, Stephane Delcroix, has just pointed me at a solution to this problem that he came up the other day:

static uint32_t
nearest_pow (uint32_t num)
{
    uint32_t j, k;
    (j = num & 0xFFFF0000) || (j = num);
    (k = j & 0xFF00FF00) || (k = j);
    (j = k & 0xF0F0F0F0) || (j = k);
    (k = j & 0xCCCCCCCC) || (k = j);
    (j = k & 0xAAAAAAAA) || (j = k);
    return j << 1;
}

The results of this implementation are as follows:

[fejj@serenity cvs]$ time ./pow4

real 0m1.249s
user 0m1.204s
sys 0m0.004s

This is actually faster than both the bitwise and the assembler implementations above!

There are two things to be aware of, though:

  • When num is 0, the value of 0 is returned (which may not be desirable depending on what you are doing with it)
  • If num is a power of 2, then instead of returning num, this implementation will return the next higher power of 2

24 comments:

Anonymous said...

BTW, are you aware of bithacks?

http://graphics.stanford.edu
/~seander/bithacks.html

It contains a few more recipes for doing log2 and rounding to powers of 2.

Anonymous said...

Isn't the naive way to use logarithms e.g.
p = log(n)/log(2)
and then raise 2 that power after rounding off p?

Jeffrey Stedfast said...

anonymous1: thanks for the link!

anonymous2: I just went with the GLib g_nearest_pow() approach for the naive implementation :)

Might be interesting to compare the log() approach to the 2 implementations I talked about on my blog... not sure if I really feel that motivated to try it tho ;-)

Anonymous said...

For the very last timing: Are you sure that the optimizer didn't make the whole calculation go away?

Anonymous said...

The call to nearest_pow() is getting optimized out in the second version. Easy fix is to acculumate the results in a second variable and then print it out at the end of the test. For example:

uint32_t i, result = 0;

for (i = 0; i < INT_MAX / 10; i++)
result += nearest_pow3 (i);
printf("%u\n", result);

Anonymous said...

Oh, and I ran it locally with the fixed version. It's still faster but only about 5 times faster, which for a rarely used function probably doesn't matter all that much.

Anonymous said...

Some ARM processors have a 'clz' instruction to count the leading zeros in a value in a register. Is there an equivalent instruction on X86? Multimedia extension instructions?

Jeffrey Stedfast said...

anonymous: good catch wrt compiler optimizing out the call to nearest_pow()! I forgot to check the assembler output of gcc to verify that -O2 didn't do that.

I've updated the article with your suggestion.

Thanks again.

Anonymous said...

You should look at the Linux implementation of ilog2. It contains a GCC-specific optimization that makes it compute the result at compile-time if the compiler knows the input at compile-time.

Anonymous said...

Why not look it up and give credit?

http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious

Anonymous said...

I wonder if using "cmovz $-1, %0" would be at all faster than doing the branch.

Anonymous said...

> Why not look it up and give credit?
>
> http://graphics.stanford.edu
> /~seander/bithacks.html
> #IntegerLogObvious

Probably because looping a bit-shift until it's >= the input value is the most obvious solution to the problem that any first-year CS student could come up with it?

None of the solutions presented on this blog are rocket science, I've seen them all used since back in the early-to-mid 90's and the stanford.edu site was only published in 2001.

Anonymous said...

n/m, 1997-2005, I misunderstood when he dated one of the entries as 2001 to mean he published it 2001.

Even so, the bit-shift loop has probably been used since the days of the PDP-11. I wouldn't be surprised if the bitwise implementation wasn't used since around the same time.

Anonymous said...

You can write more portable almost as efficient as assembly code with __builtin_clzl.

Anonymous said...

I understand the allure of this exercise, but where is this ever done often enough to be performance-critical?

Søren Hauberg said...

I really think such code is a _bad_ thing. Your fast version is practically unreadable. Sure, it might be just a little bit faster, but is the speed gain really worth it when you introduce such hard to read code? IMHO one of the most important properties of good code is that other people can read and understand it.

Anonymous said...

The solution from update 3 is probably not the fastest generally speaking. It is slightly slower that the shift version on my Core2.

Do not forget that || is in fact a disguised IF statement so this algorithm implies several conditional jumps.

My guess is that even if that algorithm can be the fastest when implemented as a standalone function, it is likely to be the slowest in real life situations where the function body is going to be inlined.

Unless the compiler is really clever, there will be far more optimization opportunities when inlining a sequence of shifts than for a sequence of conditional branches.

Jeffrey Stedfast said...

I fully understand that there are branches in Stephane's implementation, which is why I was surprised it was faster.

FWIW, gcc -O2 does inline the nearest_pow() function for all the implementations tried on this page.

That said, I would tend to agree that it's probably not the most ideal implementation. I mostly shared it because I thought it was a cool approach to the problem :)

Soren: I think that all of the C implementations on this page are quite readable. If they came with a comment about what they were supposed to do and/or a comment explaining how they worked, then there's no reason not to use these code snippets.

Different things are obvious to some people and non-obvious to others. That's what comments are for :)

NotZed said...

Oi Jeff, Hmm, i've seen something like this somewhere before ... ;-)

http://blogs.gnome.org/zucchi/2008/03/12/unexpected-bits-of-bits/


and

http://blogs.gnome.org/zucchi/2008/03/18/spewing-bits/

Although your version looks better I think. I wonder how it'd go on cell though, fewer instructions, but every instruction depends on every previous one so it might not be any better.

Power of 2 is quite a common operation with some types of code that needs to run as fast as possible, e.g. memory allocators (why I was looking at it), or bitmap tables (databases, vm page tables). It *is* worth the effort to make it run as fast as possible for this type of code.

Ever stopped to consider that no CPU design would waste the silicon on it if it wasn't useful?

Anonymous said...

I know I'm jumping a bit late on the wagon, but I thought I'd share why this works. There are two ways of finding out the greatest power of a number (here, 2) in say another number.
a) divide by 2 continually until it equals 1.
b) sum the quotients of increasing *powers* of 2 till it stops making a difference, i.e.:
x/2 + x/4 + x/8 + x/16 ...

It's clear which would be more optimal; the second one (since division by 2 is constant time), and these are respectively the original and your methods.

A little maths (keep in mind that with bit shifts, we are dealing with powers of 2) will show that:
a) runs in O(log n) : Each op is a bit shift. You have:
n << 1
n << 1
.... x times.
or 2^x = n || x = log n
b) runs in O(sqrt (log (n))) : Each op is an increasing bit shift. You have:
n << 1 | n << 2 | n << 3... (1+2+3....x times) = log n. The series 1+2+3... has a very well know summation, (x)(x+1)/2.
Let's keep it simple, and say it's something x^2. That means x^2 = log n., hense x = sqrt (log (n)).

Just wanted to share that. Hope it doesn't go over too many heads.

Anonymous said...

I would just like to follow up my previous comment and say that it is the most utter rubbish I've ever thought of, and I just wanted to jump the gun before somebody flames me for my stupidity.

The method works simply by efficiently filling up all the bits before it, and adding to move it to the next bit column.

In partial defence, I just noticed that method #4 (Stephane Delcroix's), is the most beautiful implmentation of a binary tree-ish algorithm I've ever seen. Hats off to Stephane.

Jeffrey Stedfast said...

I would have to agree about Stephane's awesomeness ;-)

David Villarreal said...

This was very useful for me, thanks !

David Villarreal said...

This was very useful, thanks!

Code Snippet Licensing

All code posted to this blog is licensed under the MIT/X11 license unless otherwise stated in the post itself.