This C++ gets g++ to emit very good x86 ASM (godbolt compiler explorer). I expect it will compile efficiently on other 64bit architectures, too (if there's a HW popcount for std::bitset::count
to use, otherwise that'll always be the slow part; e.g. sure to use g++ -march=nehalem
or higher, or -mpopcnt
if you don't want to enable anything else, if you can limit your code to only running on CPUs that support that x86 instruction):
#include <bitset>
int popcount_subset(std::bitset<64> A, int pos) {
int high_bits_to_eliminate = 63 - pos;
A <<= (high_bits_to_eliminate & 63); // puts A[pos] at A[63].
return (A[63]? ~0ULL : 0) & A.count(); // most efficient way: great code with gcc and clang
// see the godbolt link for some #ifdefs with other ways to do the check, like
// return A[BSET_SIZE-1] ? A.count() : 0;
}
This probably isn't optimal on 32bit architectures, so compare other alternatives if you need to make a 32bit build.
This will work for other sizes of bitset, as long as you do something about the hard-coded 63
s, and change the & 63
mask for the shift count into a more general range-check. For optimal performance with strange size bitsets, make a template function with a specialization for size <= register width
of the target machine. In that case, extract the bitset to an unsigned
type of the appropriate width, and shift to the top of the register instead of the top of the bitset.
You'd expect this to also generate ideal code for bitset<32>
, but it doesn't quite. gcc/clang still use 64bit registers on x86-64.
For large bitsets, shifting the whole thing will be slower than just popcounting the words below the one containing pos
, and using this on that word. (This is where a vectorized popcount really shines on x86 if you can assume SSSE3 but not the popcnt
insn hardware support, or for 32bit targets. AVX2 256bit pshufb
is the fastest way to do bulk popcounts, but without AVX2 I think 64bit popcnt
is pretty close to a 128-bit pshufb
implementation. See the comments for more discussion.)
If you have an array of 64-bit elements, and want to count bits below a certain position in each one separately, then you definitely should use SIMD. The shift parts of this algorithm vectorize, not just the popcnt part. Use psadbw
against an all-zero register to horizontal-sum bytes in 64-bit chunks after a pshufb
-based popcnt that produces counts for the bits in each byte separately. SSE/AVX doesn't have 64-bit arithmetic right shift, but you can use a different technique to blend on the high bit of each element.
How I came up with this:
The asm instructions you want to get the compiler to output will:
- remove the unwanted bits from the 64bit value
- test the highest of the wanted bits.
- popcount it.
- return 0 or popcount, depending on the result of the test. (Branchless or branching implementations both have advantages. If the branch is predictable, a branchless implementation tends to be slower.)
The obvious way to do 1 is to generate a mask ((1<<(pos+1)) -1
) and &
it. A more efficient way is to left-shift by 63-pos
, leaving the bits you want packed at the top of a register.
This also has the interesting side-effect of putting the bit you want to test as the top bit in the register. Testing the sign bit, rather than any other arbitrary bit, takes slightly fewer instructions. An arithmetic right shift can broadcast the sign bit to the rest of the register, allowing for more-efficient-than-usual branchless code.
Doing the popcount is a much-discussed problem, but is actually the trickier part of the puzzle. On x86, there is extremely efficient hardware support for it, but only on recent-enough hardware. On Intel CPUs, the popcnt
instruction is only available on Nehalem and newer. I forget when AMD added support.
So to use it safely, you need to either do CPU dispatching with a fallback that doesn't use popcnt
. Or, make separate binaries that do/don't depend on some CPU features.
popcount without the popcnt
instruction can be done a few ways. One uses SSSE3 pshufb
to implement a 4-bit LUT. This is most effective when used on a whole array, rather than a single 64b at a time, though. Scalar bithacks might be best here, and wouldn't require SSSE3 (and so would be compatible with ancient AMD CPUs that have 64bit but not pshufb.)
The Bitbroadcast:
(A[63]? ~0ULL : 0)
asks the compiler to broadcast the high bit to all other bit positions, allowing it to be used as an AND-mask to zero (or not) the popcount result. Note that even for large bitset sizes, it's still only masking the output of popcnt
, not the bitset itself, so ~0ULL
is fine I used ULL to make sure wasn't ever asking the compiler to broadcast the bit only to the low 32b of a register (with UL
on Windows, for example).
This broadcast can be done with an arithmetic right shift by 63, which shifts in copies of the high bit.
clang generated this code from the original version. After some prodding from Glenn about different implementations for 4, I realized that I could lead gcc towards clang's optimal solution by writing the source more like the ASM I want. The obvious ((int64_t)something) >> 63
to more directly request an arithmetic right shift would not be strictly portable, because signed right-shifts are implementation-defined as either arithmetic or logical. The standard doesn't provide any portable arithmetic right-shift operator. (It's not undefined behaviour, though.) Anyway, fortunately compilers are smart enough: gcc sees the best way once you give it enough of a hint.
This source makes great code on x86-64 and ARM64 with gcc and clang. Both simply use an arithmetic right shift on the input to popcnt (so the shift can run in parallel with the popcnt). It also compiles great on 32bit x86 with gcc, because the masking only happens to a 32bit variable (after multiple popcnt results are added). It's the rest of the function that's nasty on 32bit (when the bitset is larger than a register).
Original ternary-operator version with gcc
Compiled with gcc 5.3.0 -O3 -march=nehalem -mtune=haswell
(older gcc, like 4.9.2, also still emit this):
; the original ternary-operator version. See below for the optimal version we can coax gcc into emitting.
popcount_subset(std::bitset<64ul>, int):
; input bitset in rdi, input count in esi (SysV ABI)
mov ecx, esi ; x86 variable-count shift requires the count in cl
xor edx, edx ; edx=0
xor eax, eax ; gcc's workaround for popcnt's false dependency on the old value of dest, on Intel
not ecx ; two's complement bithack for 63-pos (in the low bits of the register)
sal rdi, cl ; rdi << ((63-pos) & 63); same insn as shl (arithmetic == logical left shift)
popcnt rdx, rdi
test rdi, rdi ; sets SF if the high bit is set.
cmovs rax, rdx ; conditional-move on the sign flag
ret
See How to prove that the C statement -x, ~x+1, and ~(x-1) yield the same results? for background on gcc's use of the -x == ~x + 1
two's complement identity. (And Which 2's complement integer operations can be used without zeroing high bits in the inputs, if only the low part of the result is wanted? which tangentially mentions that shl
masks the shift count, so we only need the low 6 bits of ecx
to hold 63 - pos
. Mostly linking that because I wrote it recently and anyone still reading this paragraph might find it interesting.)
Some of those instructions will go away when inlining. (e.g. gcc would generate the count in ecx in the first place.)
With Glenn's multiply instead of ternary operator idea (enabled by USE_mul
), gcc does
shr rdi, 63
imul eax, edi
at the end instead of xor
/ test
/ cmovs
.
mov r,r
: 1 fused-domain uop, 0 latency, no execution unit
xor
-zeroing: 1 fused-domain uop, no execution unit
not
: 1 uop for p0/p1/p5/p6, 1c latency, 1 per 0.25c throughput
shl
(aka sal
) with count in cl
: 3 uops for p0/p6: 2c latency, 1 per 2c throughput. (Agner Fog's data indicates that IvyBridge only takes 2 uops for this, strangely.)
popcnt
: 1 uop for p1, 3c latency, 1 per 1c throughput
shr r,imm
: 1 uop for p0/p6, 1c latency. 1 per 0.5c throughput.
imul r,r
: 1uop for p1, 3c latency.
- not counting the
ret
Totals:
- 9 fused-domain uops, can issue in 2.25 cycles (in theory; uop cache-line effects usually bottleneck the frontend slightly).
- 4 uops (shifts) for p0/p6. 2 uops for p1. 1 any-ALU-port uop. Can execute at one per 2c (saturating the shift ports