- Can't we multiply rax with edi directly?
We can't imul rax, rdi
because the calling convention allows the caller to leave garbage in the high bits of RDI; only the EDI part contains the value. This is a non-issue when inlining; writing a 32-bit register does implicitly zero-extend to the full 64-bit register, so the compiler will usually not need an extra instruction to zero-extend a 32-bit value.
(zero-extending into a different register is better because of limitations on mov-elimination, if you can't avoid it).
Taking your question even more literally, no, x86 doesn't have any multiply instructions that zero-extend one of their inputs to let you multiply a 32-bit and a 64-bit register. Both inputs must be the same width.
- Why do we multiply in 64-bit mode?
(terminology: all of this code runs in 64-bit mode. You're asking why 64-bit operand-size.)
You could mul edi
to multiply EAX with EDI to get a 64-bit result split across EDX:EAX, but mul edi
is 3 uops on Intel CPUs, vs. most modern x86-64 CPUs having fast 64-bit imul
. (Although imul r64, r64
is slower on AMD Bulldozer-family, and on some low-power CPUs.) https://uops.info/ and https://agner.org/optimize/ (instruction tables and microarch PDF)
(Fun fact: mul rdi
is actually cheaper on Intel CPUs, only 2 uops. Perhaps something to do with not having to do extra splitting on the output of the integer multiply unit, like mul edi
would have to split the 64-bit low half multiplier output into EDX and EAX halves, but that happens naturally for 64x64 => 128-bit mul.)
Also the part you want is in EDX so you'd need another mov eax, edx
to deal with it. (Again, because we're looking at code for a stand-alone definition of the function, not after inlining into a caller.)
GCC 8.3 and earlier did use 32-bit mul
instead of 64-bit imul
(https://godbolt.org/z/5qj7d5). That was not crazy for -mtune=generic
when Bulldozer-family and old Silvermont CPUs were more relevant, but those CPUs are farther in the past for more recent GCC, and its generic tuning choices reflect that. Unfortunately GCC also wasted a mov
instruction copying EDI to EAX, making this way look even worse :/
# gcc8.3 -O3 (default -mtune=generic)
div3(unsigned int):
mov eax, edi # 1 uop, stupid wasted instruction
mov edx, -1431655765 # 1 uop (same 32-bit constant, just printed differently)
mul edx # 3 uops on Sandybridge-family
mov eax, edx # 1 uop
shr eax # 1 uop
ret
# total of 7 uops on SnB-family
Would only be 6 uops with mov eax, 0xAAAAAAAB
/ mul edi
, but still worse than:
# gcc9.3 -O3 (default -mtune=generic)
div3(unsigned int):
mov eax, edi # 1 uop
mov edi, 2863311531 # 1 uop
imul rax, rdi # 1 uop
shr rax, 33 # 1 uop
ret
# total 4 uops, not counting ret
Unfortunately, 64-bit 0x00000000AAAAAAAB
can't be represented as a 32-bit sign-extended immediate, so imul rax, rcx, 0xAAAAAAAB
isn't encodeable. It would mean 0xFFFFFFFFAAAAAAAB
.
- Why are we using imul instead of mul? I thought modular arithmetic would be all unsigned.
It is unsigned. Signedness of the inputs only affects the high half of the result, but imul reg, reg
doesn't produce the high half. Only the one-operand forms of mul
and imul
are full multiplies that do NxN => 2N, so only they need separate signed and unsigned versions.
Only imul
has the faster and more flexible low-half-only forms. The only thing that's signed about imul reg, reg
is that it sets OF based on signed overflow of the low half. It wasn't worth spending more opcodes and more transistors just to have a mul r,r
whose only difference from imul r,r
is the FLAGS output.
Intel's manual (https://www.felixcloutier.com/x86/imul) even points out the fact that it can be used for unsigned.
- What's up with the 33-bit rightshift at the end? I thought we can just drop the highest 32-bits.
No, there's no multiplier constant that would give the exact right answer for every possible input x
if you implemented it that way. The "as-if" optimization rule doesn't allow approximations, only implementations that produce the exact same observable behaviour for every input the program uses. Without knowing a value-range for x
other than full range of unsigned
, compilers don't have that option. (-ffast-math
only applies to floating point; if you want faster approximations for integer math, code them manually like below):
See Why does GCC use multiplication by a strange number in implementing integer division? for more about the fixed-point multiplicative inverse method compilers use for exact division by compile time constants.
For an example of this not working in the general case, see my edit to an answer on Divide by 10 using bit shifts? which proposed
// Warning: INEXACT FOR LARGE INPUTS
// this fast approximation can just use the high half,
// so on 32-bit machines it avoids one shift instruction vs. exact division
int32_t div10(int32_t dividend)
{
int64_t invDivisor = 0x1999999A;
return (int32_t) ((invDivisor * dividend) >> 32);
}
Its first wrong answer (if you loop from 0 upward) is div10(1073741829) = 107374183
when 1073741829/10
is actually 107374182. (It rounded up instead of toward 0 like C integer division is supposed to.)
From your edit, I see you were actually talking about using the low half of a multiply result, which apparently works perfectly for exact multiples all the way up to UINT_MAX.
As you say, it completely fails when the division would have a remainder, e.g. 16 * 0xaaaaaaab
= 0xaaaaaab0
when truncated to 32-bit, not 5
.
unsigned div3_exact_only(unsigned x) {
__builtin_assume(x % 3 == 0); // or an equivalent with if() __builtin_unreachable()
return x / 3;
}
Yes, if that math works out, it would be legal and optimal for compilers to implement that with 32-bit imul. They don't look for this optimization because it's rarely a known fact. IDK if it would be worth adding compiler code to even look for the optimization, in terms of compile time, not to mention compiler maintenance cost in developer time. It's not a huge difference in runtime cost, and it's rarely going to be possible. It is nice, though.
div3_exact_only:
imul eax, edi, 0xAAAAAAAB # 1 uop, 3c latency
ret
However, it is something you can do yourself in source code, at least for known type widths like uint32_t
:
uint32_t div3_exact_only(uint32_t x) {
return x * 0xaaaaaaabU;
}