assembly - MOVZX missing 32 bit register to 64 bit register

Question

Welcome To Ask or Share your Answers For Others

assembly - MOVZX missing 32 bit register to 64 bit register

1 Reply

深蓝 · Answer 1 · 2021-10-16T22:23:08+0000

Short answer

Use mov eax, edi to zero-extend EDI into RAX if you can't already guarantee that the high bits of RDI are all zero. See: Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register?

Prefer using different source/destination registers, because mov-elimination fails for mov eax,eax on both Intel and AMD CPUs. When moving to a different register you incur zero latency with no execution unit needed. (gcc apparently doesn't know this and usually zero-extends in place.) Don't spend extra instructions to make that happen, though.

Long answer

Machine-code reason why there's no encoding for movzx with a 32-bit source

summary: Every different source width for movzx and movsx needs a different opcode. The destination width is controlled by prefixes. Since mov can do the job, a new opcode for movzx dst, r/m32 would be redundant.

When designing AMD64 assembler syntax, AMD chose not to make movzx rax, edx work as a pseudo-instruction for mov eax, edx. This is probably a good thing, because knowing that writing a 32-bit register zeros the upper bytes is very important to writing efficient code for x86-64.

AMD64 did need a new opcode for sign extension with a 32-bit source operand. They named the mnemonic movsxd for some reason, instead of making it a 3rd opcode for the movsx mnemonic. Intel documents them all together in one ISA ref manual entry. They repurposed the 1-byte opcode that was ARPL in 32-bit mode, so movsxd is actually 1 byte shorter than movsx from 8 or 16-bit sources (assuming you still need a REX prefix to extend to 64-bit).

Different destination sizes use the same opcode with different operand size¹. (66 or REX.W prefix for 16-bit or 64-bit instead of the default 32 bit.) e.g. movsx eax, bl and movsx rax, bl differ only in the REX prefix; same opcode. (movsx ax, bl is also the same, but with a 66 prefix to make the operand-size 16 bit.)

Before AMD64, there was no need for an opcode that reads a 32-bit source, because the maximum destination width was 32 bits, and "sign-extension" to the same size is just a copy. Notice that movsxd eax, eax is legal but not recommended. You can even encode it with a 66 prefix to read a 32-bit source and write a 16-bit destination².

The use of MOVSXD without REX.W in 64-bit mode is discouraged. Regular MOV should be used instead of using MOVSXD without REX.W.

32->64 bit sign extension can be done with cdq to sign-extend EAX into EDX:EAX (e.g. before 32-bit idiv). This was the only way before x86-64 (other than of course copying and using an arithmetic right shift do broadcast the sign bit).

But AMD64 already zero-extends from 32 to 64 for free with any instruction that writes a 32-bit register. This avoids false dependencies for out-of-order execution, which is why AMD broke with the 8086 / 386 tradition of leaving upper bytes untouched when writing a partial register. (Why doesn't GCC use partial registers?)

Since each source width needs a different opcode, no prefixes can make either of the two movzx opcodes read a 32-bit source.

You do sometimes need to spend an instruction to zero-extend something. It's common in compiler output for small functions, because the x86-64 SysV and Windows x64 calling conventions allow high garbage in args and return values.

As usual, ask a compiler if you want to know how to do something in asm, especially when you don't see instructions you're looking for. I've omitted the ret at the end of each function.

Source + asm from the Godbolt compiler explorer, for the System V calling convention (args in RDI, RSI, RDX, ...):

#include <stdint.h>

uint64_t zext(uint32_t a) { return a; }
uint64_t extract_low(uint64_t a) { return a & 0xFFFFFFFF; }
    # both compile to
    mov     eax, edi

int use_as_index(int *p, unsigned a) { return p[a]; }
   # gcc
    mov     esi, esi         # missed optimization: mov same,same can't be eliminated on Intel
    mov     eax, DWORD PTR [rdi+rsi*4]

   # clang
    mov     eax, esi         # with signed int a, we'd get movsxd
    mov     eax, dword ptr [rdi + 4*rax]


uint64_t zext_load(uint32_t *p) { return *p; }
    mov     eax, DWORD PTR [rdi]

uint64_t zext_add_result(unsigned a, unsigned b) { return a+b; }
    lea     eax, [rdi+rsi]

The default address-size is 64 in x86-64. High garbage doesn't affect the low bits of addition, so this saves a byte vs. lea eax, [edi+esi] which needs a 67 address-size prefix but gives identical results for every input. Of course, add edi, esi would produce a zero-extended result in RDI.

uint64_t zext_mul_result(unsigned a, unsigned b) { return a*b; }
   # gcc8.1
    mov     eax, edi
    imul    eax, esi

   # clang6.0
    imul    edi, esi
    mov     rax, rdi    # silly: mov eax,edi would save a byte here

Intel recommends destroying the result of a mov right away when you have the choice, freeing the microarchitectural resources that mov-elimination takes up and increasing the success-rate of mov-elimination (which isn't 100% on Sandybridge-family, unlike AMD Ryzen). GCC's choice of mov / imul is best.

Also, on CPUs without mov-elimination, the mov before imul might not be on the critical path if it's the other input that's not ready yet (i.e. if the critical path goes through the input that doesn't get moved). But mov after imul depends on both inputs so it's always on the critical path.

Of course, when these functions inline, the compiler will usually know the full state of registers, unless they come from function return values. And also it doesn't need to produce the result in a specific register (RAX return value). But if your source is sloppy with mixing unsigned with size_t or uint64_t, the compiler might be forced to emit instructions to truncate 64-bit values. (Looking at compiler asm output is a good way to catch that and figure out how to tweak the source to let the compiler save instructions.)

Footnote 1: Fun fact: AT&T syntax (which uses different mnemonics like movswl (sign-extend word->long (dword) or movzbl) can infer the destination size from the register like movzb %al, %ecx, but won't assemble movz %al, %ecx even though there's no ambiguity. So it treats movzb as its own mnemonic, with the usual operand-size suffix which can be inferred or explicit. This means each different opcode has its own mnemonic in AT&T syntax.

See also assembly cltq and movslq difference for a history lesson on redundancy between CDQE for EAX->RAX and MOVSXD for any registers. See What does cltq do in assembly? or the GAS docs for the AT&T vs. Intel menmonics for zero/sign-extension.

Footnote 2: Silly computer tricks with movsxd ax, [rsi]:

Assemblers refuse to assemble movsxd eax, eax or movsxd ax, eax, but it is possible to manually encode it. ndisasm doesn't even disassemble it (just db 0x63), but GNU objdump does. Actual CPUs decode it, too. I tried on Skylake just to make sure:

 ; NASM source                           ; register value after stepi in GDB
mov     rdx, 0x8081828384858687
movsxd  rax, edx                         ; RAX = 0xffffffff84858687
db 0x63, 0xc2        ;movsxd  eax, edx   ; RAX = 0x0000000084858687
xor     eax,eax                          ; RAX = 0
db 0x66, 0x63, 0xc2  ;movsxd  ax, edx    ; RAX = 0x0000000000008687

So how does the CPU handle it internally? Does it actually read 32 bits and then truncate to the operand-size? It turns out Intel's ISA reference manual documents the 16-bit form as <a href="htt

Categories

assembly - MOVZX missing 32 bit register to 64 bit register