## Saturday, June 14, 2008

### Calculating the Nearest Power of 2

The typical implementation for finding the nearest power of 2 for a given value is as follows:

```static uint32_t
nearest_pow (uint32_t num)
{
uint32_t n = 1;

while (n < num)
n <<= 1;

return n;
}
```

This implementation's performance, unfortunately, suffers as the value of num increases. Luckily there is another approach that takes a constant time no matter how large the value:

```static uint32_t
nearest_pow (uint32_t num)
{
uint32_t n = num > 0 ? num - 1 : 0;

n |= n >> 1;
n |= n >> 2;
n |= n >> 4;
n |= n >> 8;
n |= n >> 16;
n++;

return n;
}
```

A simple performance test might be:

```int main (int argc, char **argv)
{
uint32_t i, n = 0;

for (i = 0; i < INT_MAX / 10; i++)
n += nearest_pow (i);

return n > 0 ? 1 : 0;
}
```

The run-time difference between the two implementations on my AMD Athlon (/proc/cpuinfo reports AMD Athlon(TM) XP 3200+ @ 2200.141 MHz) is impressive. For performance testing, I compiled with gcc -O2 which I figure is the typical default for most packaged software on Linux distributions.

The brain-dead approach has the following results:

```[fejj@serenity cvs]\$ time ./pow

real 0m12.034s
user 0m11.809s
sys 0m0.032s
```

The bitwise implementation is insanely fast:

```[fejj@serenity cvs]\$ time ./pow2

real 0m1.361s
user 0m1.304s
sys 0m0.008s
```

Now... to be fair, the if you are using small values for num, then it's possible that the brain-dead approach might be faster. Let's try the same main() for-loop again, but this time let's request nearest_pow() with a value of 1 each time. Since it is likely that the results will be far too fast to really compare, let's also bump up the number of iterations to UINT_MAX.

```[fejj@serenity cvs]\$ time ./pow

real 0m0.003s
user 0m0.000s
sys 0m0.004s
[fejj@serenity cvs]\$ time ./pow2

real 0m0.002s
user 0m0.000s
sys 0m0.000s
```

Unfortunately, both are still far too fast to really compare performance. Let's try bumping up the value of num to see if we can find the point at which the while-loop approach starts to fall behind the bitwise approach. To start, let's try passing the value of 2 as the num argument:

```[fejj@serenity cvs]\$ time ./pow

real 0m0.002s
user 0m0.000s
sys 0m0.004s
[fejj@serenity cvs]\$ time ./pow2

real 0m0.002s
user 0m0.000s
sys 0m0.000s
```

It looks like the bitwise approach may be faster than the while-loop approach for the value of 2, but it's a bit hard to tell for sure with only UINT_MAX loops. We'd have to switch to using a 64bit i to know for sure and I'm not sure it's that important. Let's try 3 and see what we get:

```[fejj@serenity cvs]\$ time ./pow

real 0m6.053s
user 0m5.968s
sys 0m0.004s
[fejj@serenity cvs]\$ time ./pow2

real 0m0.003s
user 0m0.000s
sys 0m0.004s
```

Well, hot diggity... I think we have ourselves a winner. This suggests that for all values of num larger than 2, the performance of the while-loop approach will be outmatched by the bitwise approach and that for values less-than-or-equal to 2, the performance is nearly identical.

Update: Thanks to the anonymous commenter that noticed that my original main() program was allowing the compiler to optimize out the call to nearest_pow() in the bitwise implementation. As suggested, I updated the for-loop to accumulate the output and then used it after the loop to avoid this problem. It only seemed to change the results for the bitwise implementation in the first test, however (before the change, it reported 0.002s). Still, on my machine it is approx. 10x faster for the first test case and seems to lose no performance even in the optimal conditions for the while-loop implementation.

Update2: I was just pointed to the Linux kernel's fls() implementation for x86. Here is a new implementation using inline assembler for x86:

```static uint32_t
nearest_pow (uint32_t num)
{
int bit;

__asm__("bsrl %1,%0\n\t"
"jnz 1f\n\t"
"movl \$-1,%0\n"
"1:" : "=r" (bit) : "rm" (num));

return (1 << (bit + 1));
}
```

The results for the original INT_MAX / 10 iterations using i as the num argument yields the following results:

```[fejj@serenity cvs]\$ time ./pow3

real 0m1.335s
user 0m1.296s
sys 0m0.004s
```

The results seem negligibly faster than the C bitwise implementation and obviously less portable :(

Update3: A friend of mine, Stephane Delcroix, has just pointed me at a solution to this problem that he came up the other day:

```static uint32_t
nearest_pow (uint32_t num)
{
uint32_t j, k;
(j = num & 0xFFFF0000) || (j = num);
(k = j & 0xFF00FF00) || (k = j);
(j = k & 0xF0F0F0F0) || (j = k);
(k = j & 0xCCCCCCCC) || (k = j);
(j = k & 0xAAAAAAAA) || (j = k);
return j << 1;
}
```

The results of this implementation are as follows:

```[fejj@serenity cvs]\$ time ./pow4

real 0m1.249s
user 0m1.204s
sys 0m0.004s
```

This is actually faster than both the bitwise and the assembler implementations above!

There are two things to be aware of, though:

• When num is 0, the value of 0 is returned (which may not be desirable depending on what you are doing with it)
• If num is a power of 2, then instead of returning num, this implementation will return the next higher power of 2