Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
583 views
in Technique[技术] by (71.8m points)

performance - Checking if TWO SSE registers are not both zero without destroying them

I want to test if two SSE registers are not both zero without destroying them.

This is the code I currently have:

uint8_t *src;  // Assume it is initialized and 16-byte aligned
__m128i xmm0, xmm1, xmm2;

xmm0 = _mm_load_si128((__m128i const*)&src[i]); // Need to preserve xmm0 & xmm1
xmm1 = _mm_load_si128((__m128i const*)&src[i+16]);
xmm2 = _mm_or_si128(xmm0, xmm1);
if (!_mm_testz_si128(xmm2, xmm2)) { // Test both are not zero
}

Is this the best way (using up to SSE 4.2)?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I learned something useful from this question. Let's first look at some scalar code

extern foo2(int x, int y);
void foo(int x, int y) {
    if((x || y)!=0) foo2(x,y);
}

Compile this like this gcc -O3 -S -masm=intel test.c and the important assembly is

 mov       eax, edi   ; edi = x, esi = y -> copy x into eax
 or        eax, esi   ; eax = x | y and set zero flag in FLAGS if zero
 jne       .L4        ; jump not zero

Now let's look at testing SIMD registers for zero. Unlike scalar code there is no SIMD FLAGS register. However, with SSE4.1 there are SIMD test instructions which can set the zero flag (and carry flag) in the scalar FLAGS register.

extern foo2(__m128i x, __m128i y);
void foo(__m128i x, __m128i y) {
    __m128i z = _mm_or_si128(x,y);
    if (!_mm_testz_si128(z,z)) foo2(x,y);
}

Compile with c99 -msse4.1 -O3 -masm=intel -S test_SSE.c and the the important assembly is

movdqa      xmm2, xmm0 ; xmm0 = x, xmm1 = y, copy x into xmm2
por         xmm2, xmm1 ; xmm2 = x | y
ptest       xmm2, xmm2 ; set zero flag if zero
jne         .L4        ; jump not zero 

Notice that this takes one more instruction because the packed bit-wise OR does not set the zero flag. Notice also that both the scalar version and the SIMD version need to use an additional register (eax in the scalar case and xmm2 in the SIMD case). So to answer your question your current solution is the best you can do.

However, if you did not have a processor with SSE4.1 or better you would have to use _mm_movemask_epi8. Another alternative which only needs SSE2 is to use _mm_movemask_epi8

extern foo2(__m128i x, __m128i y);
void foo(__m128i x, __m128i y) {
    if (_mm_movemask_epi8(_mm_or_si128(x,y))) foo2(x,y);   
}

The important assembly is

movdqa      xmm2, xmm0
por         xmm2, xmm1
pmovmskb    eax, xmm2
test        eax, eax
jne         .L4

Notice that this needs one more instruction then with the SSE4.1 ptest instruction.

Until now I have been using the pmovmaskb instruction because the latency is better on pre Sandy Bridge processors than with ptest. However, I realized this before Haswell. On Haswell the latency of pmovmaskb is worse than the latency of ptest. They both have the same throughput. But in this case this is not really important. What's important (which I did not realize before) is that pmovmaskb does not set the FLAGS register and so it requires another instruction. So now I'll be using ptest in my critical loop. Thank you for your question.

Edit: as suggested by the OP there is a way this can be done without using another SSE register.

extern foo2(__m128i x, __m128i y);
void foo(__m128i x, __m128i y) {
    if (_mm_movemask_epi8(x) | _mm_movemask_epi8(y)) foo2(x,y);    
}

The relevant assembly from GCC is:

pmovmskb    eax, xmm0
pmovmskb    edx, xmm1
or          edx, eax
jne         .L4

Instead of using another xmm register this uses two scalar registers.

Note that fewer instructions does not necessarily mean better performance. Which of these solutions is best? You have to test each of them to find out.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

56.8k users

...