cmov
is an ALU select operation that always reads both sources before checking the condition. Using a memory source doesn't change this. It's not like an ARM predicated instruction that acts like a NOP if the condition was false. cmovz eax, [mem]
also unconditionally writes EAX, zero-extending into RAX regardless of the condition.
As far as the most of the CPU is concerned (the out-of-order scheduler and so on), cmovcc reg, [mem]
is handled exactly like adc reg, [mem]
: a 3-input 1-output ALU instruction. (adc
writes flags, unlike cmov
, but nevermind that.) The micro-fused memory source operand is a separate uop that just happens to be part of the same x86 instruction. This is how the ISA rules for it work, too.
So really, a more appropriate mnemonic for cmovz
as a selectz
x86's only conditional loads (that don't fault on bad addresses, just potentially run slowly) are:
Normal loads protected by conditional branches. Branch mis-prediction or other mis-speculations leading to running a faulting load are handled fairly efficiently (maybe starting a page walk, but once the mis-speculation is identified, execution of the correct flow of instructions doesn't have to wait for any memory operation started by speculative execution).
If there was a TLB hit on a page you can't read, then not much more happens until a faulting load reaches retirement (known to be non-speculative and thus actually taking a #PF
page-fault exception which is unavoidably going to be slow). On some CPUs, this fast handling leads to the Meltdown attack. >.< See http://blog.stuffedcow.net/2018/05/meltdown-microarchitecture/.
rep lodsd
with RCX=0 or 1. (not fast or efficient, but microcode branches are special and can't benefit from branch prediction, on Intel CPUs. See What setup does REP do?. Andy Glew mentions microcode branch mispredictions, but I think those are different from normal branch misses because there seems to be a fixed cost.)
AVX2 vpmaskmovd/q
/ AVX1 vmaskmovps/pd
. Faults are suppressed for elements where the mask is 0. A mask-load with an all-0 mask even from a legal address requires a ~200 cycle microcode assist with a base+index addressing mode.) See section 12.9 CONDITIONAL SIMD PACKED LOADS AND STORES and Table C-8 in Intel's optimization manual. (On Skylake, stores to an illegal address with an all-zero mask also need an assist.)
The earlier MMX/SSE2 maskmovdqu
is store-only (and has an NT hint). Only the similar AVX instruction with dword/qword (instead of byte) elements has a load form.
AVX512 masked loads
AVX2 gathers with some / all mask elements cleared.
... and maybe others I'm forgetting. Normal loads inside TSX / RTM transactions: a fault aborts the transaction instead of raising a #PF. But you can't count on a bad index faulting instead of just reading bogus data from somewhere nearby, so it's not really a conditional load. It's also not super fast.
An alternative might be to cmov
an address that you use unconditionally, selecting which address to load from. e.g. if you had a 0
to load from somewhere else, that would work. But then you'd have to calculate the table indexing in a register, not using an addressing mode, so you could cmov
the final address.
Or just CMOV the index and pad the table with some zero bytes at the end so you can load from table + 128
.
Or use a branch, it will probably predict well for a lot of cases. But maybe not for languages like French where you'll find a mix of low-128 and higher Unicode code-points in common text.
Code Review
Note that [rel]
only works when there's no register (other than RIP) involved in the addressing mode. RIP-relative addressing replaces one of the 2 redundant ways (in 32-bit code) to encode a [disp32]
. It uses the shorter non-SIB encoding, while a ModRM+SIB can still encode an absolute [disp32]
with no registers. (Useful for addresses like [fs: 16]
for small offsets relative to thread-local storage with segment bases.)
If you just want to use RIP-relative addressing when possible, use default rel
at the top of your file. [symbol]
will be RIP-relative, but [symbol + rax]
won't. Unfortunately, NASM and YASM default to default abs
.
[reg + disp32]
is a very efficient way to index static data in position-dependent code, just don't fool yourself into thinking that it can be RIP-relative. See 32-bit absolute addresses no longer allowed in x86-64 Linux?.
[rel ascii_flags + EDI]
is also weird because you're using a 32-bit register in an addressing mode in x86-64 code. There's usually no reason to spend an address-size prefix to truncate addresses to 32-bit.
However, in this case if your table is in the low 32-bits of virtual address space, and your function arg is only specified as 32 bits (so the caller is allowed to leave garbage in the upper 32 of RDI), it is actually a win to use [disp32 + edi]
instead of a mov esi,edi
or something to zero-extend. If you're doing that on purpose, definitely comment why you're using a 32-bit addressing mode.
But in this case, using a cmov
on the index will zero-extend to 64-bit for you.
It's also weird to use a DWORD load from a table of bytes. You'll occasionally cross a cache-line boundary and suffer extra latency.
@fuz showed a version using a RIP-relative LEA and a CMOV on the index.
In position-dependent code where 32-bit absolute addresses are ok, by all means use it to save instructions. [disp32]
addressing modes are worse than RIP-relative (1 byte longer), but [reg + disp32]
addressing modes are perfectly fine when position-dependent code and 32-bit absolute addresses are ok. (e.g. x86-64 Linux, but not OS X where executable are always mapped outside the low 32 bits.) Just be aware that it's not rel
.
; position-dependent version taking advantage of 32-bit absolute [reg + disp32] addressing
; not usable in shared libraries, only non-PIE executables.
ft_isprint:
mov eax, 128 ; offset of dummy entry for "not ASCII"
cmp edi, eax ; check if ascii
cmovae edi, eax ; replace with 128 if outside 0..127
; cmov also zero-extends EDI into RDI
movzx eax, byte [ascii_flags + rdi] ; load table entry
and al, flag_print ; mask the desired flag
; if the caller is only going to read / test AL anyway, might as well save bytes here
ret
If any existing entry in your table has the same flags you want for high inputs, e.g. maybe entry 0
which you'll never see in implicit-length strings, you could still xor-zero EAX and keep your tables at 128 bytes, not 129.
test r32, imm32
takes more code bytes than you need. ~127 = 0xFFFFFF80
would fit in a sign-extended byte, but is no TEST r/m32, sign-extended-imm8
encoding. There is such an encoding for cmp
, though, like essentially all other immediate instructions.
You could instead check for unsigned above 127, with cmp edi, 127
/ cmovbe eax, edi
or cmova edi, eax
. This saves 3 bytes of code-size. Or we can save 4 bytes by using cmp reg,reg
using the 128
we used for a table index.
A range-check before array indexing is also more intuitive for most humans than checking high bits anyway.
and al, imm8
is only 2 bytes, vs. 3 bytes for and r/m32, sign-extended-imm8
. It's not slower on any CPUs, as long as the caller only reads AL. On Intel CPUs before Sandybridge, reading EAX after ANDing into AL could cause a partial-register stall / slowdown. Sandybridge doesn't rename partial registers for read-modify-write operations, if I recall correctly, and IvB and later don't rename low8 partial regs at all.
You might also use mov al, [table]
instead of movzx
to save another code byte. An earlier mov eax, 128
already broke any false dependency on the old value of EAX so it shouldn't have a performance downside. But movzx
is not a bad idea.
When all else is equal, smaller code-size is almost always better (for instruction-cache footprint, and sometimes for packing into the uop cache). If it cost any extra uops or introduced any false dependencies, it wouldn't be worth it when optimizing for speed, though.