Why is it slow
The reason using a 16-bit register is expensive as opposed to using an 8-bit register is that 16-bit register instructions are decoded in microcode. This means an extra cycle during decoding and inability to be paired whilst decoding.
Also because ax is a partial register it will take an extra cycle to execute because the top part of the register needs to be combined with the write to the lower part.
8-bit writes have special hardware put in place to speed this up, but 16-bit writes do not. Again on many processors the 16-bit instructions take 2 cycles instead of one and they do not allow pairing.
This means that instead of being able to process 12 instructions (3 per cycle) in 4 cycles, you can now only execute 1, because you have a stall when decoding the instruction into microcode and a stall when processing the microcode.
How can I make it faster?
mov al, bl
mov ah, bh
(This code takes a minimum of 2 CPU-cycles and may give a stall on the second instruction because on some (older) x86 CPU's you get a lock on EAX)
Here's what happens:
- EAX is read. (cycle 1)
- The lower byte of EAX is changed (still cycle 1)
- and the full value is written back into EAX. (cycle 1)
- EAX is locked for writing until the first write is fully resolved. (potential wait for multiple cycles)
- The process is repeated for the high byte in EAX. (cycle 2)
On the lastest Core2 CPU's this is not so much of a problem, because extra hardware has been put in place that knows that bl
and bh
really never get in each other's way.
mov eax, ebx
Which moves 4 bytes at a time, that single instruction will run in 1 cpu-cycle (and can be paired with other instructions in parallel).
- If you want fast code, always use the 32-bit (EAX, EBX etc) registers.
- Try to avoid using the 8-bit sub-registers, unless you have to.
- Never use the 16-bit registers. Even if you have to use 5 instructions in 32-bit mode, that will still be faster.
- Use the movzx reg, ... (or movsx reg, ...) instructions
Speeding up the code
I see a few opportunities to speed up the code.
; some variables on stack
%define cr DWORD [ebp-20]
%define dcr DWORD [ebp-24]
%define dcg DWORD [ebp-32]
%define dcb DWORD [ebp-40]
mov edx,cr
loop:
add esi, dcg
mov eax, esi
shr eax, 8
add edi, dcb
mov ebx, edi
shr ebx, 16 ;higher 16 bits in ebx will be empty.
mov bh, ah
;mov eax, cr
;add eax, dcr
;mov cr, eax
add edx,dcr
mov eax,edx
and eax,0xFFFF0000 ; clear lower 16 bits in EAX
or eax,ebx ; merge the two.
;mov ah, bh ; faster
;mov al, bl
mov DWORD [epb+offset+ecx*4], eax ; requires storing the data in reverse order.
;add edx, 4
sub ecx,1 ;dec ecx does not change the carry flag, which can cause
;a false dependency on previous instructions which do change CF
jge loop
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…