Yes it's possible. But as of AVX2, it's unlikely to be better than the scalar approaches with MULX/ADCX/ADOX.
There's virtually an unlimited number of variations of this approach for different input/output domains. I'll only cover 3 of them, but they are easy to generalize once you know how they work.
Disclaimers:
- All solutions here assume the rounding mode is round-to-even.
- Use of fast-math optimization flags is not recommended as these solutions rely on strict IEEE.
Signed doubles in the range: [-251, 251]
// A*B = L + H*2^52
// Input: A and B are in the range [-2^51, 2^51]
// Output: L and H are in the range [-2^51, 2^51]
void mul52_signed(__m256d& L, __m256d& H, __m256d A, __m256d B){
const __m256d ROUND = _mm256_set1_pd(30423614405477505635920876929024.); // 3 * 2^103
const __m256d SCALE = _mm256_set1_pd(1. / 4503599627370496); // 1 / 2^52
// Multiply and add normalization constant. This forces the multiply
// to be rounded to the correct number of bits.
H = _mm256_fmadd_pd(A, B, ROUND);
// Undo the normalization.
H = _mm256_sub_pd(H, ROUND);
// Recover the bottom half of the product.
L = _mm256_fmsub_pd(A, B, H);
// Correct the scaling of H.
H = _mm256_mul_pd(H, SCALE);
}
This is the simplest one and the only one which is competitive with the scalar approaches. The final scaling is optional depending on what you want to do with the outputs. So this can be considered only 3 instructions. But it's also the least useful since both the inputs and outputs are floating-point values.
It is absolutely critical that both the FMAs stay fused. And this is where fast-math optimizations can break things. If the first FMA is broken up, then L
is no longer guaranteed to be in the range [-2^51, 2^51]
. If the second FMA is broken up, L
will be completely wrong.
Signed integers in the range: [-251, 251]
// A*B = L + H*2^52
// Input: A and B are in the range [-2^51, 2^51]
// Output: L and H are in the range [-2^51, 2^51]
void mul52_signed(__m256i& L, __m256i& H, __m256i A, __m256i B){
const __m256d CONVERT_U = _mm256_set1_pd(6755399441055744); // 3*2^51
const __m256d CONVERT_D = _mm256_set1_pd(1.5);
__m256d l, h, a, b;
// Convert to double
A = _mm256_add_epi64(A, _mm256_castpd_si256(CONVERT_U));
B = _mm256_add_epi64(B, _mm256_castpd_si256(CONVERT_D));
a = _mm256_sub_pd(_mm256_castsi256_pd(A), CONVERT_U);
b = _mm256_sub_pd(_mm256_castsi256_pd(B), CONVERT_D);
// Get top half. Convert H to int64.
h = _mm256_fmadd_pd(a, b, CONVERT_U);
H = _mm256_sub_epi64(_mm256_castpd_si256(h), _mm256_castpd_si256(CONVERT_U));
// Undo the normalization.
h = _mm256_sub_pd(h, CONVERT_U);
// Recover bottom half.
l = _mm256_fmsub_pd(a, b, h);
// Convert L to int64
l = _mm256_add_pd(l, CONVERT_D);
L = _mm256_sub_epi64(_mm256_castpd_si256(l), _mm256_castpd_si256(CONVERT_D));
}
Building off of the first example, we combine it with a generalized version of the fast double <-> int64
conversion trick.
This one is more useful since you're working with integers. But even with the fast conversion trick, most of the time will be spent doing conversions. Fortunately, you can eliminate some of the input conversions if you are multiplying by the same operand multiple times.
Unsigned integers in the range: [0, 252)
// A*B = L + H*2^52
// Input: A and B are in the range [0, 2^52)
// Output: L and H are in the range [0, 2^52)
void mul52_unsigned(__m256i& L, __m256i& H, __m256i A, __m256i B){
const __m256d CONVERT_U = _mm256_set1_pd(4503599627370496); // 2^52
const __m256d CONVERT_D = _mm256_set1_pd(1);
const __m256d CONVERT_S = _mm256_set1_pd(1.5);
__m256d l, h, a, b;
// Convert to double
A = _mm256_or_si256(A, _mm256_castpd_si256(CONVERT_U));
B = _mm256_or_si256(B, _mm256_castpd_si256(CONVERT_D));
a = _mm256_sub_pd(_mm256_castsi256_pd(A), CONVERT_U);
b = _mm256_sub_pd(_mm256_castsi256_pd(B), CONVERT_D);
// Get top half. Convert H to int64.
h = _mm256_fmadd_pd(a, b, CONVERT_U);
H = _mm256_xor_si256(_mm256_castpd_si256(h), _mm256_castpd_si256(CONVERT_U));
// Undo the normalization.
h = _mm256_sub_pd(h, CONVERT_U);
// Recover bottom half.
l = _mm256_fmsub_pd(a, b, h);
// Convert L to int64
l = _mm256_add_pd(l, CONVERT_S);
L = _mm256_sub_epi64(_mm256_castpd_si256(l), _mm256_castpd_si256(CONVERT_S));
// Make Correction
H = _mm256_sub_epi64(H, _mm256_srli_epi64(L, 63));
L = _mm256_and_si256(L, _mm256_set1_epi64x(0x000fffffffffffff));
}
Finally we get the answer to the original question. This builds off of the signed integer solution by adjusting the conversions and adding a correction step.
But at this point, we're at 13 instructions - half of which are high-latency instructions, not counting the numerous FP <-> int
bypass delays. So it's unlikely this will be winning any benchmarks. By comparison, a 64 x 64 -> 128-bit
SIMD multiply can be done in 16 instructions (14 if you pre-process the inputs.)
The correction step can be omitted if the rounding mode is round-down or round-to-zero. The only instruction where this matters is h = _mm256_fmadd_pd(a, b, CONVERT_U);
. So on AVX512, you can override the rounding for that instruction and leave the rounding mode alone.
Final Thoughts:
It's worth noting that the 252 range of operation can be reduced by adjusting the magic constants. This may be useful for the first solution (the floating-point one) since it gives you extra mantissa to use for accumulation in floating-point. This lets you bypass the need to constantly to convert back-and-forth between int64 and double like in the last 2 solutions.
While the 3 examples here are unlikely to be better than scalar methods, AVX512 will almost certainly tip the balance. Knights Landing in particular has poor throughput for ADCX and ADOX.
And of course all of this is moot when AVX512-IFMA comes out. That reduces a full 52 x 52 -> 104-bit
product to 2 instructions and gives the accumulation for free.