x86 - Why are mov ah,bh and mov al, bl together much faster than single instruction mov ax, bx?

Question

Welcome To Ask or Share your Answers For Others

x86 - Why are mov ah,bh and mov al, bl together much faster than single instruction mov ax, bx?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

x86 - Why are mov ah,bh and mov al, bl together much faster than single instruction mov ax, bx?

I've found that

mov al, bl
mov ah, bh

is much faster than

mov ax, bx

Can anyone explain me why? I'm running on Core 2 Duo 3 Ghz, in 32-bit mode under Windows XP. Compiling using NASM and then linking with VS2010. Nasm compile command:

nasm -f coff -o triangle.o triangle.asm

Here is the main loop I'm using to render a triangle:

; some variables on stack
%define cr  DWORD [ebp-20]
%define dcr DWORD [ebp-24]
%define dcg DWORD [ebp-32]
%define dcb DWORD [ebp-40]

loop:

add esi, dcg
mov eax, esi
shr eax, 8

add edi, dcb
mov ebx, edi
shr ebx, 16
mov bh, ah

mov eax, cr
add eax, dcr
mov cr, eax

mov ah, bh  ; faster
mov al, bl
;mov ax, bx

mov DWORD [edx], eax

add edx, 4

dec ecx
jge loop

I can provide whole VS project with sources for testing.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T19:32:17+0000

Why is it slow
The reason using a 16-bit register is expensive as opposed to using an 8-bit register is that 16-bit register instructions are decoded in microcode. This means an extra cycle during decoding and inability to be paired whilst decoding.
Also because ax is a partial register it will take an extra cycle to execute because the top part of the register needs to be combined with the write to the lower part.
8-bit writes have special hardware put in place to speed this up, but 16-bit writes do not. Again on many processors the 16-bit instructions take 2 cycles instead of one and they do not allow pairing.

This means that instead of being able to process 12 instructions (3 per cycle) in 4 cycles, you can now only execute 1, because you have a stall when decoding the instruction into microcode and a stall when processing the microcode.

How can I make it faster?

mov al, bl
mov ah, bh

(This code takes a minimum of 2 CPU-cycles and may give a stall on the second instruction because on some (older) x86 CPU's you get a lock on EAX)
Here's what happens:

EAX is read. (cycle 1)
- The lower byte of EAX is changed (still cycle 1)
- and the full value is written back into EAX. (cycle 1)
EAX is locked for writing until the first write is fully resolved. (potential wait for multiple cycles)
The process is repeated for the high byte in EAX. (cycle 2)

On the lastest Core2 CPU's this is not so much of a problem, because extra hardware has been put in place that knows that bl and bh really never get in each other's way.

mov eax, ebx

Which moves 4 bytes at a time, that single instruction will run in 1 cpu-cycle (and can be paired with other instructions in parallel).

If you want fast code, always use the 32-bit (EAX, EBX etc) registers.
Try to avoid using the 8-bit sub-registers, unless you have to.
Never use the 16-bit registers. Even if you have to use 5 instructions in 32-bit mode, that will still be faster.
Use the movzx reg, ... (or movsx reg, ...) instructions

Speeding up the code
I see a few opportunities to speed up the code.

; some variables on stack
%define cr  DWORD [ebp-20]
%define dcr DWORD [ebp-24]
%define dcg DWORD [ebp-32]
%define dcb DWORD [ebp-40]

mov edx,cr

loop:

add esi, dcg
mov eax, esi
shr eax, 8

add edi, dcb
mov ebx, edi
shr ebx, 16   ;higher 16 bits in ebx will be empty.
mov bh, ah

;mov eax, cr   
;add eax, dcr
;mov cr, eax

add edx,dcr
mov eax,edx

and eax,0xFFFF0000  ; clear lower 16 bits in EAX
or eax,ebx          ; merge the two. 
;mov ah, bh  ; faster
;mov al, bl


mov DWORD [epb+offset+ecx*4], eax ; requires storing the data in reverse order. 
;add edx, 4

sub ecx,1  ;dec ecx does not change the carry flag, which can cause
           ;a false dependency on previous instructions which do change CF    
jge loop

Categories

x86 - Why are mov ah,bh and mov al, bl together much faster than single instruction mov ax, bx?

x86 - Why are mov ah,bh and mov al, bl together much faster than single instruction mov ax, bx?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags