Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.0k views
in Technique[技术] by (71.8m points)

assembly - SIMD instructions for floating point equality comparison (with NaN == NaN)

Which instructions would be used for comparing two 128 bit vectors consisting of 4 * 32-bit floating point values?

Is there an instruction that considers a NaN value on both sides as equal? If not, how big would the performance impact of a workaround that provides reflexivity (i.e. NaN equals NaN) be?

I heard that ensuring reflexivity would have a significant performance impact compared with IEEE semantics, where NaN doesn't equal itself, and I'm wondering if big that impact would be.

I know that you typically want use epsilon comparisons instead of exact quality when dealing floating-point values. But this question is about exact equality comparisons, which you could for example use to eliminate duplicate values from a hash-set.

Requirements

  • +0 and -0 must compare as equal.
  • NaN must compare equal with itself.
  • Different representations of NaN should be equal, but that requirement might be sacrificed if the performance impact is too big.
  • The result should be a boolean, true if all four float elements are the same in both vectors and false if at least one element differs. Where true is represented by a scalar integer 1 and false by 0.

Test cases

(NaN, 0, 0, 0) == (NaN, 0, 0, 0) // for all representations of NaN
(-0,  0, 0, 0) == (+0,  0, 0, 0) // equal despite different bitwise representations
(1,   0, 0, 0) == (1,   0, 0, 0)
(0,   0, 0, 0) != (1,   0, 0, 0) // at least one different element => not equal 
(1,   0, 0, 0) != (0,   0, 0, 0)

My idea for implementing this

I think it might be possible to combine two NotLessThan comparisons (CMPNLTPS ?) using and to achieve the desired result. The assembler equivalent of AllTrue(!(x < y) and !(y < x)) or AllFalse((x < y) or (y > x).

Background

The background for this question is Microsoft's plan to add a Vector type to .NET. Where I'm arguing for a reflexive .Equals method and need a clearer picture of how big the performance impact of this reflexive equals over a IEEE equals would be. See Should Vector<float>.Equals be reflexive or should it follow IEEE 754 semantics? on programmers.se for the long story.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Even AVX VCMPPS (with it's greatly enhanced choice of predicates) doesn't give us a single-instruction predicate for this. You have to do at least two compares and combine the results. It's not too bad, though.

  • different NaN encodings aren't equal: effectively 2 extra insns (adding 2 uops). Without AVX: One extra movaps beyond that.

  • different NaN encodings are equal: effectively 4 extra insns (adding 4 uops). Without AVX: Two extra movaps insn

An IEEE compare-and-branch is 3 uops: cmpeqps / movmskps / test-and-branch. Intel and AMD both macro-fuse the test-and-branch into a single uop/m-op.

With AVX512: bitwise-NaN is probably just one extra instruction, since normal vector compare and branch probably uses vcmpEQ_OQps / ktest same,same / jcc, so combining two different mask regs is free (just change the args to ktest). The only cost is the extra vpcmpeqd k2, xmm0,xmm1.

AVX512 any-NaN is just two extra instructions (2x VFPCLASSPS, with the 2nd one using the result of the first as a zeromask. See below). Again, then ktest with two different args to set flag.


My best idea so far: ieee_equal || bitwise_equal

If we give up on considering different NaN encodings equal to each other:

  • Bitwise equal catches two identical NaNs.
  • IEEE equal catches the +0 == -0 case.

There are no cases where either compare gives a false positive (since ieee_equal is false when either operand is NaN: we want just equal, not equal-or-unordered. AVX vcmpps provides both options, while SSE only provides a plain equal operation.)

We want to know when all elements are equal, so we should start with inverted comparisons. It's easier to check for at least one non-zero element than to check for all elements being non-zero. (i.e. horizontal AND is hard, horizontal OR is easy (pmovmskb / test, or ptest). Taking the opposite sense of a comparison is free (jnz instead of jz).) This is the same trick that Paul R used.

; inputs in xmm0, xmm1
movaps    xmm2, xmm0    ; unneeded with 3-operand AVX instructions

cmpneqps  xmm2, xmm1    ; 0:A and B are ordered and equal.  -1:not ieee_equal.  predicate=NEQ_UQ in VEX encoding expanded notation
pcmpeqd   xmm0, xmm1    ; -1:bitwise equal  0:otherwise

; xmm0   xmm2
;   0      0   -> equal   (ieee_equal only)
;   0     -1   -> unequal (neither)
;  -1      0   -> equal   (bitwise equal and ieee_equal)
;  -1     -1   -> equal   (bitwise equal only: only happens when both are NaN)

andnps    xmm0, xmm2    ; NOT(xmm0) AND xmm2
; xmm0 elements are -1 where  (not bitwise equal) AND (not IEEE equal).
; xmm0 all-zero iff every element was bitwise or IEEE equal, or both
movmskps  eax, xmm0
test      eax, eax      ; it's too bad movmsk doesn't set EFLAGS according to the result
jz no_differences

For double-precision, ...PS and pcmpeqQ will work the same.

If the not-equal code goes on to find out which element isn't equal, a bit-scan on the movmskps result will give you the position of the first difference.

With SSE4.1 PTEST you can replace andnps/movmskps/test-and-branch with:

ptest    xmm0, xmm2   ; CF =  0 == (NOT(xmm0) AND xmm2).
jc no_differences

I expect this is the first time most people have ever seen the CF result of PTEST be useful for anything. :)

It's still three uops on Intel and AMD CPUs ( (2ptest + 1jcc) vs (pandn + movmsk + fused-test&branch)), but fewer instructions. It is more efficient if you're going to setcc or cmovcc instead of jcc, since those can't macro-fuse with test.

That makes a total of 6 uops (5 with AVX) for a reflexive compare-and-branch, vs. 3 uops for an IEEE compare-and-branch. (cmpeqps / movmskps / test-and-branch.)

PTEST has a very high latency on AMD Bulldozer-family CPUs (14c on Steamroller). They have one cluster of vector execution units shared by two integer cores. (This is their alternative to hyperthreading.) This increases the time until a branch mispredict can be detected, or the latency of a data-dependency chain (cmovcc / setcc).

PTEST sets ZF when 0==(xmm0 AND xmm2): set if no elements were both bitwise_equal AND IEEE (neq or unordered). i.e. ZF is unset if any element was bitwise_equal while also being !ieee_equal. This can only happen when a pair of elements contain bitwise-equal NaNs (but can happen when other elements are unequal).

    movaps    xmm2, xmm0
    cmpneqps  xmm2, xmm1    ; 0:A and B are ordered and equal.
    pcmpeqd   xmm0, xmm1    ; -1:bitwise equal

    ptest    xmm0, xmm2
    jc   equal_reflexive   ; other cases

...

equal_reflexive:
    setnz  dl               ; set if at least one both-nan element

There's no condition that tests CF=1 AND anything about ZF. ja tests CF=0 and ZF=1. It's unlikely that you'd only want to test that anyway, so putting a jnz in the jc branch target works fine. (And if you did only want to test equal_reflexive AND at_least_one_nan, a different setup could probably set flags appropriately).


Considering all NaNs equal, even when not bitwise equal:

This is the same idea as Paul R's answer, but with a bugfix (combine NaN check with IEEE check using AND rather than OR.)

; inputs in xmm0, xmm1
movaps      xmm2, xmm0
cmpordps    xmm2, xmm2      ; find NaNs in A.  (0: NaN.  -1: anything else).  Same as cmpeqps since src and dest are the same.
movaps      xmm3, xmm1
cmpordps    xmm3, xmm3      ; find NaNs in B
orps        xmm2, xmm3      ; 0:A and B are both NaN.  -1:anything else

cmpneqps    xmm0, xmm1      ; 0:IEEE equal (and ordered).  -1:unequal or unordered
; xmm0 AND xmm2  is zero where elements are IEEE equal, or both NaN
; xmm0   xmm2 
;   0      0     -> equal   (ieee_equal and both NaN (impossible))
;   0     -1     -> equal   (ieee_equal)
;  -1      0     -> equal   (both NaN)
;  -1     -1     -> unequal (neither equality condition)

ptest    xmm0, xmm2        ; ZF=  0 == (xmm0 AND xmm2).  Set if no differences in any element
jz   equal_reflexive
; else at least one element was unequal

;     alternative to PTEST:  andps  xmm0, xmm2 / movmskps / test / jz

So in this case we don't need PTEST's CF result after all. We do when using PCMPEQD, because it doesn't have an inverse (the way cmpunordps has cmpordps).

9 fused-domain uops for Intel SnB-family CPUs. (7 with AVX: use non-destructive 3-operand instructions to avoid the movaps.) However, pre-Skylake SnB-family CPUs can only run cmpps on p1, so this bottlenecks on the FP-add unit if throughput is a concern. Skylake runs cmpps on p0/p1.

andps has a shorter encoding than pand, and Intel CPUs from Nehalem to Broadwell can only run it on port5. That may be desirable to prevent it from stealing a p0 or p1 cycle from surrounding FP code. Otherwise pandn is probably a better choice. On AMD BD-family, andnps runs in the ivec domain anyway, so you don't avoid the bypass delay between int and FP vectors (which you might otherwise expect to manage if you use movmskps instead of ptest, in this version that only uses cmpps, not pcmpeqd). Also note that instruction ordering is chosen for human readability here. Putting the FP compare(A,B) earlier, before the ANDPS, might help the CPU get started on that a cycle sooner.

If one operand is reused, it should be possible to reuse its self-NaN-finding result. The new operand still needs its self-NaN check, and a compare against the reused operand, so we only save one movaps/cmpps.

If the vectors are in memory, at least one of them needs to be loaded with a separate load insn. The other one can just be referenced twice from memory. This sucks if it's unaligned or the addressing mode can't micro-fuse, but could be useful. If one of the operands to vcmpps is a vector known to not have any NaNs (e.g. a zeroed register), vcmpunord_qps xmm2, xmm15, [rsi] will find NaNs in [rsi].

If we don't want to use PTEST, we can get the same result by using the opposite comparisons, but combining them with the opposite logical operator (AND vs. OR).

; inputs in xmm0, xmm1
movaps      xmm2, xmm0
cmpunordps  xmm2, xmm2      ; find NaNs in A (-1:NaN  0:anything else)
movaps      xmm3, xmm1
cmpunordps  xmm3, xmm3      ; find NaNs in B
andps       xmm2, xmm3      ; xmm2 = (-1:both NaN  0:anything else)
; now in the same boat as before: xmm2 is set for elements we want to consider equal, even though they're not IEEE equal

cmpeqps     xmm0, xmm1      ; -1:ieee_equal  0:unordered or unequal
; xmm0   xmm2 
;  -1      0     -> equal   (ieee_equal)
;  -1     -1     -> equal   (ieee_equal and both NaN (impossible))
;   0      0     -> unequal (neither)
;   0     -1     -> equal   (both NaN)

orps        xmm0, xmm2      ; 0: unequal.  -1:reflexive_equal
movmskps    eax, xmm0
test        eax, eax
jnz  equal_reflexive

Other ideas: unfinished, non-viable, broken, or worse-than-the-above

The all-ones result of a true comparison is an encoding of NaN. (Try it out. Perhaps we can avoid using POR or PAND to combine results from cmpps on each operand separately?

; inputs in A:xmm0 B:xmm1
movaps      xmm2, xmm0
cmpordps    xmm2, xmm2      ; find NaNs in A.  (0: NaN.  -1: anything else).  Same as cmpeqps since src and dest are the same.
; cmpunordps wouldn't be useful: NaN stays NaN, while other values are zeroed.  (This could be useful if ORPS didn't exist)

; integer -1 (all-ones) is a NaN encoding, but all-zeros is 0.0
cmpunordps  xmm2, xmm1
; A:NaN B:0   ->  0   unord 0   -> false
; A:0   B:NaN ->  NaN unord NaN -> true

; A:0   B:0   ->  NaN unord 0   -> true
; A:NaN B:NaN ->  0   unord NaN -> true

; Desired:   0 where A and B are both NaN.

cmpordps xmm2, xmm1 just flips the final result for each case, with the "odd-man-out" still on the 1st row.

We can only get the result we want (true iff A and B are both NaN) if both inputs are inverted (NaN -> non-NaN and vice versa). This means we could use this idea for cmpordps as a replacement for pand after doing cmpordps self,self on both A and B. This isn't useful: even if


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...