To answer my third question I found a faster solution for double-double addition. I found an alternative definition in the paper Implementation of
float-float operators on graphics
hardware.
Theorem 5 (Add22 theorem) Let be ah+al and bh+bl the float-float arguments of the following
algorithm:
Add22 (ah ,al ,bh ,bl)
1 r = ah ⊕ bh
2 if | ah | ≥ | bh | then
3 s = ((( ah ? r ) ⊕ bh ) ⊕ b l ) ⊕ a l
4 e l s e
5 s = ((( bh ? r ) ⊕ ah ) ⊕ a l ) ⊕ b l
6 ( rh , r l ) = add12 ( r , s )
7 return (rh , r l)
Here is how I implemented this (pseudo-code):
static inline doubledoublen add22(doubledoublen const &a, doubledouble const &b) {
doublen aa,ab,ah,bh,al,bl;
booln mask;
aa = abs(a.hi); //_mm256_and_pd
ab = abs(b.hi);
mask = aa >= ab; //_mm256_cmple_pd
// z = select(cut,x,y) is a SIMD version of z = cut ? x : y;
ah = select(mask,a.hi,b.hi); //_mm256_blendv_pd
bh = select(mask,b.hi,a.hi);
al = select(mask,a.lo,b.lo);
bl = select(mask,b.lo,a.lo);
doublen r, s;
r = ah + bh;
s = (((ah - r) + bh) + bl ) + al;
return two_sum(r,s);
}
This definition of Add22 uses 11 additions instead of 20 but it requires some additional code to determine if |ah| >= |bh|
. Here is a discussion on how to implement SIMD minmag and maxmag functions. Fortunately, most of the additional code does not use port 1. Now only 12 instructions go to port 1 instead of 20.
Here is a throughput analysis form IACA for the new Add22
Throughput Analysis Report
--------------------------
Block Throughput: 12.05 Cycles Throughput Bottleneck: Port1
Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
---------------------------------------------------------------------------------------
| Cycles | 0.0 0.0 | 12.0 | 2.5 2.5 | 2.5 2.5 | 2.0 | 10.0 | 0.0 | 2.0 |
---------------------------------------------------------------------------------------
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | | vmovapd ymm3, ymmword ptr [rip]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | | vmovapd ymm0, ymmword ptr [rdx]
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | | vmovapd ymm4, ymmword ptr [rsi]
| 1 | | | | | | 1.0 | | | | vandpd ymm2, ymm4, ymm3
| 1 | | | | | | 1.0 | | | | vandpd ymm3, ymm0, ymm3
| 1 | | 1.0 | | | | | | | CP | vcmppd ymm2, ymm3, ymm2, 0x2
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | | vmovapd ymm3, ymmword ptr [rsi+0x20]
| 2 | | | | | | 2.0 | | | | vblendvpd ymm1, ymm0, ymm4, ymm2
| 2 | | | | | | 2.0 | | | | vblendvpd ymm4, ymm4, ymm0, ymm2
| 1 | | | 0.5 0.5 | 0.5 0.5 | | | | | | vmovapd ymm0, ymmword ptr [rdx+0x20]
| 2 | | | | | | 2.0 | | | | vblendvpd ymm5, ymm0, ymm3, ymm2
| 2 | | | | | | 2.0 | | | | vblendvpd ymm0, ymm3, ymm0, ymm2
| 1 | | 1.0 | | | | | | | CP | vaddpd ymm3, ymm1, ymm4
| 1 | | 1.0 | | | | | | | CP | vsubpd ymm2, ymm1, ymm3
| 1 | | 1.0 | | | | | | | CP | vaddpd ymm1, ymm2, ymm4
| 1 | | 1.0 | | | | | | | CP | vaddpd ymm1, ymm1, ymm0
| 1 | | 1.0 | | | | | | | CP | vaddpd ymm0, ymm1, ymm5
| 1 | | 1.0 | | | | | | | CP | vaddpd ymm2, ymm3, ymm0
| 1 | | 1.0 | | | | | | | CP | vsubpd ymm1, ymm2, ymm3
| 2^ | | | | | 1.0 | | | 1.0 | | vmovapd ymmword ptr [rdi], ymm2
| 1 | | 1.0 | | | | | | | CP | vsubpd ymm0, ymm0, ymm1
| 1 | | 1.0 | | | | | | | CP | vsubpd ymm1, ymm2, ymm1
| 1 | | 1.0 | | | | | | | CP | vsubpd ymm3, ymm3, ymm1
| 1 | | 1.0 | | | | | | | CP | vaddpd ymm0, ymm3, ymm0
| 2^ | | | | | 1.0 | | | 1.0 | | vmovapd ymmword ptr [rdi+0x20], ymm0
and here is the throughput analysis from the old
Throughput Analysis Report
--------------------------
Block Throughput: 20.00 Cycles Throughput Bottleneck: Port1
Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 |
---------------------------------------------------------------------------------------
| Cycles | 0.0 0.0 | 20.0 | 2.0 2.0 | 2.0 2.0 | 2.0 | 0.0 | 0.0 | 2.0 |
---------------------------------------------------------------------------------------
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | |
---------------------------------------------------------------------------------
| 1 | | | 1.0 1.0 | | | | | | | vmovapd ymm0, ymmword ptr [rsi]
| 1 | | | | 1.0 1.0 | | | | | | vmovapd ymm1, ymmword ptr [rdx]
| 1 | | | 1.0 1.0 | | | | | | | vmovapd ymm3, ymmword ptr [rsi+0x20]
| 1 | | 1.0 | | | | | | | CP | vaddpd ymm4, ymm0, ymm1
| 1 | | | | 1.0 1.0 | | | | | | vmovapd ymm5, ymmword ptr [rdx+0x20]
| 1 | | 1.0 | | | | | | | CP | vsubpd ymm2, ymm4, ymm0
| 1 | | 1.0 | | | | | | | CP | vsubpd ymm1, ymm1, ymm2
| 1 | | 1.0 | | | | | | | CP | vsubpd ymm2, ymm4, ymm2
| 1 | | 1.0 | | | | | | | CP | vsubpd ymm0, ymm0, ymm2
| 1 | | 1.0 | | | | | | | CP | vaddpd ymm2, ymm0, ymm1
| 1 | | 1.0 | | | | | | | CP | vaddpd ymm1, ymm3, ymm5
| 1 | | 1.0 | | | | | | | CP | vsubpd ymm6, ymm1, ymm3
| 1 | | 1.0 | | | | | | | CP | vsubpd ymm5, ymm5, ymm6
| 1 | | 1.0 | | | | | | | CP | vsubpd ymm6, ymm1, ymm6
| 1 | | 1.0 | | | | | | | CP | vaddpd ymm1, ymm2, ymm1
| 1 | | 1.0 | | | | | | | CP | vsubpd ymm3, ymm3, ymm6
| 1 | | 1.0 | | | | | | | CP | vaddpd ymm2, ymm4, ymm1
| 1 | | 1.0 | | | | | | | CP | vaddpd ymm3, ymm3, ymm5
| 1 | | 1.0 | | | | | | | CP | vsubpd ymm4, ymm2, ymm4
| 1 | | 1.0 | | | | | | | CP | vsubpd ymm1, ymm1, ymm4
| 1 | | 1.0 | | | | | | | CP | vaddpd ymm0, ymm1, ymm3
| 1 | | 1.0 | | | | | | | CP | vaddpd ymm1, ymm2, ymm0
| 1 | | 1.0 | | | | | | | CP | vsubpd ymm2, ymm1, ymm2
| 2^ | | | | | 1.0 | | | 1.0 | | vmovapd ymmword ptr [rdi], ymm1
| 1 | | 1.0 | | | | | | | CP | vsubpd ymm0, ymm0, ymm2
| 2^ | | | | | 1.0 | | | 1.0 | | vmovapd ymmword ptr [rdi+0x20], ymm0
A better solution would be if there were three operand single rounding mode instructions besides FMA. It seems to me there should be single rounding mode instructions for
a + b + c
a * b + c //FMA - this is the only one in x86 so far
a * b * c