If you're targeting x86
most compilers will have an instrinsic for the pdep
(parallel bit deposit) instruction which directly performs the operation you want, in hardware, at a rate of 1 per cycle (3 cycles latency)1, on Intel hardware that supports it. For example, gcc offers it as the _pdep_u32
and _pdep_u64
intrinsic functions.
Unfortunately, on AMD Ryzen (the only AMD hardware that supports BMI2) this operation is very slow: one per 18 cycles. You might want to have a separate code-path to support non-Intel platforms if they are important to you.
If you aren't on x86
, you can find general purpose implementations of these options here - the specific operation you want is expand_right
- and this other section will probably be of great interest in that it specifically covers the simple case where you are dealing with word-sized elements.
In practice, if you are really dealing with 8-bit data and mask values, you might just use a precomputed lookup table - either a big 8 bit x 8 bit = 65k one which covers all {data, mask}
combinations and which gives you the answer directly, or a 256-entry one which covers all mask
values and gives you some coefficients for a simple bit-shifting calculation or a multiplication-based code.
FWIW, I'm not sure how you can do it easily with 5 rotate instructions, because it seems that the naive solution needs 1 rotate instruction for each bit, whether set or not (so for a word size of 8 bits, 7 or 8 rotate2 instructions).
1 Of course, the performance in principle depends on the hardware, but on all the mainstream Intel CPUs that implement it, it's 1 cycle throughput, 3 cycles latency (not sure about AMD).
2 Only 7 rotates because the "rotate of 0" operation for the lowest order bit can evidently be omitted.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…