There were two problems with my muldws1_improved
function in my question. One of them is that it missed a carry when I did xl*yh + xh*yl
. This is why it failed the unit tests. But the other is that there are signed*unsigned products which require a lot more machine logic than is seen in the C code. (see my edit below). I found a better solution which is to optimized the unsigned product function muldwu1 first and then do
muldwu1(w,x,y);
w[0] -= ((x<0) ? y : 0) + ((y<0) ? x : 0);
to correct for the sign.
Here is my attempt at improving the muldwu1
using the lower word lo = x*y
(yes this function passes the unit tests from Hacker's delight).
void muldwu1_improved(uint32_t w[], uint32_t x, uint32_t y) {
uint16_t xl = x; uint16_t xh = x >> 16;
uint16_t yl = y; uint16_t yh = y >> 16;
uint32_t lo = x*y; //32x32 to 32
uint32_t t1 = xl*yh; //16x16 to 32
uint32_t t2 = xh*yl; //16x16 to 32
uint32_t t3 = xh*yh; //16x16 to 32
uint32_t t = t1 + t2;
uint32_t tl = 0xFFFF & t;
uint32_t th = t >> 16;
uint32_t loh = lo >> 16;
uint32_t cy = ((t<t1) << 16) + (loh<tl); //carry
w[1] = lo;
w[0] = t3 + th + cy;
}
This uses one less addition than the original function from Hacker's delight but it has to do two comparisons
1 mul32x32 to 32
3 mul16x16 to 32
4 add32
5 shift logical (or shuffles)
1 and
2 compare32
***********
16 operations
Edit:
I was bothered by a statement in Hacker's Delight (2nd Edition) which says in regards to the mulhs and mulhu algorithm.
The algorithm requires 16 basic RISC instructions in either the signed or unsigned version, four of which are multiplications.
I implemented the unsigned algorithm in only 16 SSE instructions but my signed version required more instructions. I figured out why and I can now answer my own question.
The reason I failed to find a better version that in Hacker's Delight is that their hypothetical RISC processor has an instruction which calculates the lower word of the product of two words. In other words, their algorithm is already optimized for this case and so it's unlikely there is a better version than the one they already have.
The reason they list an alternative is because they assume multiplication (and division) may be more expensive than other instructions and so they left the alternative as a case to optimize on.
So the C code does not hide significant machine logic. It assumes the machine can do word * word to lower word.
Why does this matter? In their algorithm they do first
u0 = u >> 16;
and later
t = u0*v1 + k;
if u = 0x80000000
u0 = 0xffff8000
. However, if your CPU can only take half word products to get a full word the upper half word of u0
is ignored and you get the wrong signed result.
In this case you should calculate the unsigned upper word and then correct using hi -= ((x<0) ? y : 0) + ((y<0) ? x : 0);
as I already stated.
The reason I am interested in this is that Intel's SIMD instruction (SSE2 through AVX2) do not have an instruction which does 64x64 to 64, they only have 32x32 to 64. That's why my signed version requires more instructions.
But AVX512 has a 64x64 to 64 instruction. Therefore with AVX512 the signed version should take the same number of instructions as the unsigned. However, since the 64x64 to 64 instruction may be much slower than the 32x32 to 64 instruction it may make more sense to do the unsigned version anyway and then correct.