The typical implementation for finding the nearest power of 2 for a given value is as follows:

static uint32_t nearest_pow (uint32_t num) { uint32_t n = 1; while (n < num) n <<= 1; return n; }

This implementation's performance, unfortunately, suffers as the value of *num* increases. Luckily there is another approach that takes a constant time no matter how large the value:

static uint32_t nearest_pow (uint32_t num) { uint32_t n = num > 0 ? num - 1 : 0; n |= n >> 1; n |= n >> 2; n |= n >> 4; n |= n >> 8; n |= n >> 16; n++; return n; }

A simple performance test might be:

int main (int argc, char **argv) { uint32_t i, n = 0; for (i = 0; i < INT_MAX / 10; i++) n += nearest_pow (i); return n > 0 ? 1 : 0; }

The run-time difference between the two implementations on my AMD Athlon (/proc/cpuinfo reports AMD Athlon(TM) XP 3200+ @ 2200.141 MHz) is impressive. For performance testing, I compiled with *gcc -O2* which I figure is the typical default for most packaged software on Linux distributions.

The brain-dead approach has the following results:

[fejj@serenity cvs]$ time ./pow real 0m12.034s user 0m11.809s sys 0m0.032s

The bitwise implementation is insanely fast:

[fejj@serenity cvs]$ time ./pow2 real 0m1.361s user 0m1.304s sys 0m0.008s

Now... to be fair, the if you are using small values for *num*, then it's possible that the brain-dead approach might be faster. Let's try the same main() for-loop again, but this time let's request nearest_pow() with a value of *1* each time. Since it is likely that the results will be far too fast to really compare, let's also bump up the number of iterations to *UINT_MAX*.

[fejj@serenity cvs]$ time ./pow real 0m0.003s user 0m0.000s sys 0m0.004s [fejj@serenity cvs]$ time ./pow2 real 0m0.002s user 0m0.000s sys 0m0.000s

Unfortunately, both are still far too fast to really compare performance. Let's try bumping up the value of *num* to see if we can find the point at which the while-loop approach starts to fall behind the bitwise approach. To start, let's try passing the value of *2* as the *num* argument:

[fejj@serenity cvs]$ time ./pow real 0m0.002s user 0m0.000s sys 0m0.004s [fejj@serenity cvs]$ time ./pow2 real 0m0.002s user 0m0.000s sys 0m0.000s

It *looks* like the bitwise approach may be faster than the while-loop approach for the value of 2, but it's a bit hard to tell for sure with only *UINT_MAX* loops. We'd have to switch to using a 64bit *i* to know for sure and I'm not sure it's that important. Let's try *3* and see what we get:

[fejj@serenity cvs]$ time ./pow real 0m6.053s user 0m5.968s sys 0m0.004s [fejj@serenity cvs]$ time ./pow2 real 0m0.003s user 0m0.000s sys 0m0.004s

Well, hot diggity... I think we have ourselves a winner. This suggests that for all values of *num* larger than 2, the performance of the while-loop approach will be outmatched by the bitwise approach and that for values less-than-or-equal to 2, the performance is nearly identical.

**Update:** Thanks to the anonymous commenter that noticed that my original *main()* program was allowing the compiler to optimize out the call to nearest_pow() in the bitwise implementation. As suggested, I updated the for-loop to accumulate the output and then used it after the loop to avoid this problem. It only seemed to change the results for the bitwise implementation in the first test, however (before the change, it reported 0.002s). Still, on my machine it is approx. 10x faster for the first test case and seems to lose no performance even in the optimal conditions for the while-loop implementation.

**Update2:** I was just pointed to the Linux kernel's *fls()* implementation for x86. Here is a new implementation using inline assembler for x86:

static uint32_t nearest_pow (uint32_t num) { int bit; __asm__("bsrl %1,%0\n\t" "jnz 1f\n\t" "movl $-1,%0\n" "1:" : "=r" (bit) : "rm" (num)); return (1 << (bit + 1)); }

The results for the original *INT_MAX / 10* iterations using *i* as the *num* argument yields the following results:

[fejj@serenity cvs]$ time ./pow3 real 0m1.335s user 0m1.296s sys 0m0.004s

The results seem negligibly faster than the C bitwise implementation and obviously less portable :(

**Update3:** A friend of mine, Stephane Delcroix, has just pointed me at a solution to this problem that he came up the other day:

static uint32_t nearest_pow (uint32_t num) { uint32_t j, k; (j = num & 0xFFFF0000) || (j = num); (k = j & 0xFF00FF00) || (k = j); (j = k & 0xF0F0F0F0) || (j = k); (k = j & 0xCCCCCCCC) || (k = j); (j = k & 0xAAAAAAAA) || (j = k); return j << 1; }

The results of this implementation are as follows:

[fejj@serenity cvs]$ time ./pow4 real 0m1.249s user 0m1.204s sys 0m0.004s

This is actually *faster* than both the bitwise and the assembler implementations above!

There are two things to be aware of, though:

- When num is 0, the value of 0 is returned (which may not be desirable depending on what you are doing with it)
- If
*num*is a power of 2, then instead of returning*num*, this implementation will return the next higher power of 2