Counting Bits

The problem

How can you count the number of set bits in a 32 bit int?

(N.B. we consider ints consisting of 32 bits here because that's convenient for most current architectures at the moment. Dealing with whatever size ints you happen to have on your architecture is left as an exercise for the reader.)

The naïve solution

The obvious bit scanning solution is reasonably straight forward. If you are having a problem with this you probably need to start with a basic programming course!

Scan through all 32 bits and count how many are set using a simple binary and to test each in turn. Since x is local we just shift it so each bit moves to the end in turn.

int nbits1(unsigned int x)
{
    int i, n = 0;

    for (i=0; i<32; i++) {
        n += x & 1;
        x >>= 1;
    }
    return n;
}

A slightly less naïve solution

Since we drop each bit in turn the int will become zero when there are no more set bits left. We can stop at that point as there is nothing left to be counted.

int nbits2(unsigned int x)
{
    int i, n = 0;

    for (i=0; x && i<32; i++) {
        n += x & 1;
        x >>= 1;
    }
    return n;
}

Count only what needs to be counted

You can use the trick that x = (x-1) & x subtracts off the least significant non-zero bit. Not only does this iterate over just the set bits but by removing the bit number iteration it makes the code trivially independent of the size of ints and bytes.

int nbits3(unsigned int x)
{
    int n = 0;

    while (x) {
        x &= x - 1;
        n++;
    }
    return n;
}

A loop-free solution

This last is called the collapsing partial parallel sums method. Even if you have to look the details up on the net it's useful to know that a loop-free method involving shift/mask/add exists.

int nbits4(unsigned int x)
{
    x = (x >> 1 & 0x55555555) + (x & 0x55555555);
    x = (x >> 2 & 0x33333333) + (x & 0x33333333);
    x = (x >> 4 & 0x0f0f0f0f) + (x & 0x0f0f0f0f);
    x = (x >> 8 & 0x00ff00ff) + (x & 0x00ff00ff);
    return (x >> 16) + (x & 0x0000ffff);
}

Performance

So which is fastest? Our first attempt, nbits1, has fairly straightforward performance. It simply performs the same operations for each of the n bits. Namely "and", "sum", "shift", "increment" and "compare" giving a total of 5 x 32 = 160 operations. Our last attempt, the collapsing partial parallel sums method in nbits4 has no loop and simply performs a fixed number of operations: 5 "shifts", 9 "ands", 5 "sums" and 4 "assigns", giving a total of 23 operations.

Now, all operations being equal (and on modern CPUs that isn't always too bad an assumption) that would indicate that the collapsing partial parallel sums method should be almost 7 times faster than the naïve implementation.

But it isn't. In fact it's more like 2.5 to 3 times faster. That's quite a difference. Why? Well, in the naïve case there is good separation between modifying a variable and its next use. That's good for modern pipelined, multi-ALU, processors. The collapsing partial parallel sums method is, on closer inspection, not so parallel however. Each line not only depends on the result of the line before, it also has an internal dependency that requires the result of a shift to be masked. This sort of read-after-write dependency prevents the CPU from making best use of its ability to execute instructions in parallel. Thus a significant portion of the expected performance is lost to waits for prior instructions to complete.

While, in this case, the "clever" implementation is still faster than the naïve implementation the difference between the expected performance the the measured performance is worth bearing in mind. The simple but seeming wasteful algorithms often peform better than expected while the more convoluted "hand tuned" algorithms often slow down in unexpected ways. Sometimes the simple ways are actually better!

So that's nbits1 and nbits4. What of nbits2 and nbits3? After all, these are optimizations of the simple implementation so they could well challenge the unexpectedly slow nbits4, couldn't they?

nbits2 has an extra loop condition that depends on the result of an operation within the loop, namely the shifting off of the current bit. Now generally that dependency is bad, especially in a tight loop where there isn't any chance of the processor having anything else to do but wait. In this case however, the extra test allows us to stop processing when there are no more set bits. So if there are n zero bits at the end we're going to save a good few instructions by not scanning them. Whether that will give a net gain or loss in speed will depend on two things: the architecture and implementation of your target processor and how much of your expected data set has runs of zeros in the high bits.

nbits3 appears, at first glance, to offer significant improvement in that it scans only the set bits and does so without the need for a loop index to iterate over the bits. However, it's simplicity is deceptive. Just about every step is dependent on the result of the immediately prior step. You can't mask x until you have the result of x - 1 and you can't test the loop condition until you have the result of the mask. For processors of yesteryear that is not an issue but any modern out-of-order execution, pipelined processor is going to be spending most of its time waiting for results and its throughput will be way below what it is capable of. Like nbits2 nbits3 is going to depend on the data you give it for its speed. This time it is the total number of set bits that affect the speed rather than the position of the last set bit - but the actual processing throughput is likely to be lower due to the dependencies.

Conclusion

The performance of even a seemingly simple piece of code is not necessarily as obvious as you might think. The performance of modern processors is highly dependent on their ability to execute sequences of instructions pipelined, in parallel and out-of-order.

The loop-free nbits4 may be the "cleverest" solution but on modern processors it fails to deliver the performance that might have been expected. The solution with the least code, nbits3 also suffers for the same reasons. In fact, if your data represents a bit map that is filled from one end and tends to have runs of zero bits at the other, a naïve solution with an early termination check such as nbits2 may actually be the best solution!

Skip to: