Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
355 views
in Technique[技术] by (71.8m points)

c - GCC compiles leading zero count poorly unless Haswell specified

GCC supports the __builtin_clz(int x) builtin, which counts the number of number of leading zeros (consecutive most-significant zeros) in the argument.

Among other things0, this is great for efficiently implementing the lg(unsigned int x) function, which takes the base-2 logarithm of x, rounding down1:

/** return the base-2 log of x, where x > 0 */
unsigned lg(unsigned x) {
  return 31U - (unsigned)__builtin_clz(x);
}

This works in the straightforward way - in particular consider the case x == 1 and clz(x) == 31 - then x == 2^0 so lg(x) == 0 and 31 - 31 == 0 and we get the correct result. Higher values of x work similarly.

Assuming the builtin is efficiently implemented, this ends much better than the alternate pure C solutions.

Now as it happens, the count leading zeros operation is essentially the dual of the bsr instruction in x86. That returns the index of the most-significant 1-bit2 in the argument. So if there are 10 leading zeros, the first 1-bit is in bit 21 of the argument. In general we have 31 - clz(x) == bsr(x) and in so bsr in fact directly implements our desired lg() function, without the superfluous 31U - ... part.

In fact, you can read between the line and see that the __builtin_clz function was implemented with bsr in mind: it is defined as undefined behavior if the argument is zero, when of course the "leading zeros" operation is perfectly well-defined as 32 (or whatever the bit-size of int is) with a zero argument. So __builtin_clz was certainly implemented with the idea of being efficiently mapped to a bsr instruction on x86.

However, looking at what GCC actually does, at -O3 with otherwise default options: it adds a ton of extra junk:

lg(unsigned int):
        bsr     edi, edi
        mov     eax, 31
        xor     edi, 31
        sub     eax, edi
        ret 

The xor edi,31 line is effectively a not edi for the bottom 4 bits that actually matter, that's off-by-one3 from neg edi which turns the result of bsr into clz. Then the 31 - clz(x) is carried out.

However with -mtune=haswell, the code simplifies into exactly the expected single bsr instruction:

lg(unsigned int):
        bsr     eax, edi
        ret

Why that is the case is very unclear to me. The bsr instruction has been around for a couple decades before Haswell, and the behavior is, AFAIK, unchanged. It's not just an issue of tuning for a particular arch, since bsr + a bunch of extra instructions isn't going to be faster than a plain bsr and furthermore using -mtune=haswell still results in the slower code.

The situation for 64-bit inputs and outputs is even slightly worse: there is an extra movsx in the critical path which seems to do nothing since the result from clz will never be negative. Again, the -march=haswell variant is optimal with a single bsr instruction.

Finally, let's check the big players in the non-Windows compiler space, icc and clang. icc just does a bad job and adds redundant stuff like neg eax; add eax, 31; neg eax; add eax, 31; - wtf? clang does a good job regardless of -march.


0 Such as scanning a bitmap for the first set bit.

1 The logarithm of 0 is undefinited, and so calling our function with a 0 argument is undefined behavior.

2 Here, the LSB is the 0th bit and the MSB is the 31st.

3 Recall that -x == ~x + 1 in twos-complement.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This looks like a known issue with gcc: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50168


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...