From Hacker's Delight, a nice branchless solution:
uint32_t flp2 (uint32_t x)
{
x = x | (x >> 1);
x = x | (x >> 2);
x = x | (x >> 4);
x = x | (x >> 8);
x = x | (x >> 16);
return x - (x >> 1);
}
This typically takes 12 instructions. You can do it in fewer if your CPU has a "count leading zeroes" instruction.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…