How can you count the number of set bits in a 32 bit int?
(N.B. we consider ints consisting of 32 bits here because that's convenient for most current architectures at the moment. Dealing with whatever size ints you happen to have on your architecture is left as an exercise for the reader.)
The obvious bit scanning solution is reasonably straight forward. If you are having a problem with this you probably need to start with a basic programming course!
Scan through all 32 bits and count how many are set using a simple
binary and to test each in turn. Since x
is local we just
shift it so each bit moves to the end in turn.
int nbits1(unsigned int x)
{
int i, n = 0;
for (i=0; i<32; i++) {
n += x & 1;
x >>= 1;
}
return n;
}
Since we drop each bit in turn the int will become zero when there are no more set bits left. We can stop at that point as there is nothing left to be counted.
int nbits2(unsigned int x)
{
int i, n = 0;
for (i=0; x && i<32; i++) {
n += x & 1;
x >>= 1;
}
return n;
}
You can use the trick that x = (x-1) & x
subtracts off the least
significant non-zero bit. Not only does this iterate over just the set bits but by
removing the bit number iteration it makes the code trivially independent of the size
of ints and bytes.
int nbits3(unsigned int x)
{
int n = 0;
while (x) {
x &= x - 1;
n++;
}
return n;
}
This last is called the collapsing partial parallel sums method. Even if you have to look the details up on the net it's useful to know that a loop-free method involving shift/mask/add exists.
int nbits4(unsigned int x)
{
x = (x >> 1 & 0x55555555) + (x & 0x55555555);
x = (x >> 2 & 0x33333333) + (x & 0x33333333);
x = (x >> 4 & 0x0f0f0f0f) + (x & 0x0f0f0f0f);
x = (x >> 8 & 0x00ff00ff) + (x & 0x00ff00ff);
return (x >> 16) + (x & 0x0000ffff);
}
So which is fastest? Our first attempt, nbits1
, has fairly
straightforward performance. It simply performs the same operations for each of
the n bits. Namely "and", "sum", "shift", "increment" and "compare" giving
a total of 5 x 32 = 160 operations. Our last attempt, the collapsing
partial parallel sums method in nbits4
has no loop and simply performs
a fixed number of operations: 5 "shifts", 9 "ands", 5 "sums" and 4 "assigns",
giving a total of 23 operations.
Now, all operations being equal (and on modern CPUs that isn't always too bad an assumption) that would indicate that the collapsing partial parallel sums method should be almost 7 times faster than the naïve implementation.
But it isn't. In fact it's more like 2.5 to 3 times faster. That's quite a difference. Why? Well, in the naïve case there is good separation between modifying a variable and its next use. That's good for modern pipelined, multi-ALU, processors. The collapsing partial parallel sums method is, on closer inspection, not so parallel however. Each line not only depends on the result of the line before, it also has an internal dependency that requires the result of a shift to be masked. This sort of read-after-write dependency prevents the CPU from making best use of its ability to execute instructions in parallel. Thus a significant portion of the expected performance is lost to waits for prior instructions to complete.
While, in this case, the "clever" implementation is still faster than the naïve implementation the difference between the expected performance the the measured performance is worth bearing in mind. The simple but seeming wasteful algorithms often peform better than expected while the more convoluted "hand tuned" algorithms often slow down in unexpected ways. Sometimes the simple ways are actually better!
So that's nbits1
and nbits4
. What of nbits2
and nbits3
? After all, these are optimizations of the simple implementation
so they could well challenge the unexpectedly slow nbits4
, couldn't they?
nbits2
has an extra loop condition that depends on the result
of an operation within the loop, namely the shifting off of the current bit.
Now generally that dependency is bad, especially in a tight loop where there
isn't any chance of the processor having anything else to do but wait. In this
case however, the extra test allows us to stop processing when there are no
more set bits. So if there are n zero bits at the end we're going to
save a good few instructions by not scanning them. Whether that will give a
net gain or loss in speed will depend on two things: the architecture and
implementation of your target processor and how much of your expected data
set has runs of zeros in the high bits.
nbits3
appears, at first glance, to offer significant improvement
in that it scans only the set bits and does so without the need for a loop
index to iterate over the bits. However, it's simplicity is deceptive. Just about
every step is dependent on the result of the immediately
prior step. You can't mask x
until you have the result of x - 1
and you can't test the loop condition until you have the result of the mask.
For processors of yesteryear that is not an issue but any modern out-of-order
execution, pipelined processor is going to be spending most of its time waiting
for results and its throughput will be way below what it is capable of. Like
nbits2
nbits3
is going to depend on the data you give
it for its speed. This time it is the total number of set bits that affect
the speed rather than the position of the last set bit - but the actual
processing throughput is likely to be lower due to the dependencies.
The performance of even a seemingly simple piece of code is not necessarily as obvious as you might think. The performance of modern processors is highly dependent on their ability to execute sequences of instructions pipelined, in parallel and out-of-order.
The loop-free nbits4
may be the "cleverest" solution but on
modern processors it fails to deliver the performance that might have been
expected. The solution with the least code, nbits3
also suffers for
the same reasons. In fact, if your data represents a bit map that is filled
from one end and tends to have runs of zero bits at the other, a naïve
solution with an early termination check such as nbits2
may
actually be the best solution!