Consider mild code-golfing to shrink your code instead of expanding it, especially before a loop. e.g. xor eax,eax
/ cdq
if you need two zeroed registers, or mov eax, 1
/ lea ecx, [rax+1]
to set registers to 1 and 2 in only 8 total bytes instead of 10. See Set all bits in CPU register to 1 efficiently for more about that, and Tips for golfing in x86/x64 machine code for more general ideas. Probably you still want to avoid false dependencies, though.
Or fill extra space by creating a vector constant on the fly instead of loading it from memory. (Adding more uop-cache pressure could be worse, though, for the larger loop that contains your setup + inner loop. But it avoids d-cache misses for constants, so it has an upside to compensate for running more uops.)
If you weren't already using them to load "compressed" constants, pmovsxbd
, movddup
, or vpbroadcastd
are longer than movaps
. dword / qword broadcast loads are free (no ALU uop, just a load).
If you're worried about code alignment at all, you're probably worried about how it sits in the L1I cache or where the uop-cache boundaries are, so just counting total uops is no longer sufficient, and a few extra uops in the block before the one you care about may not be a problem at all.
But in some situations, you might really want to optimize decode throughput / uop-cache usage / total uops for the instructions before the block you want aligned.
Padding instructions, like the question asked for:
Agner Fog has a whole section on this: "10.6 Making instructions longer for the sake of alignment" in his "Optimizing subroutines in assembly language" guide. (The lea
, push r/m64
, and SIB ideas are from there, and I copied a sentence / phrase or two, otherwise this answer is my own work, either different ideas or written before checking Agner's guide.)
It hasn't been updated for current CPUs, though: lea eax, [rbx + dword 0]
has more downsides than it used to vs mov eax, ebx
, because you miss out on zero-latency / no execution unit mov
. If it's not on the critical path, go for it though. Simple lea
has fairly good throughput, and an LEA with a large addressing mode (and maybe even some segment prefixes) can be better for decode / execute throughput than mov
+ nop
.
Use the general form instead of the short form (no ModR/M) of instructions like push reg
or mov reg,imm
. e.g. use 2-byte push r/m64
for push rbx
. Or use an equivalent instruction that is longer, like add dst, 1
instead of inc dst
, in cases where there are no perf downsides to inc
so you were already using inc
.
Use SIB byte. You can get NASM to do that by using a single register as an index, like mov eax, [nosplit rbx*1]
(see also), but that hurts the load-use latency vs. simply encoding mov eax, [rbx]
with a SIB byte. Indexed addressing modes have other downsides on SnB-family, like un-lamination and not using port7 for stores.
So it's best to just encode base=rbx + disp0/8/32=0
using ModR/M + SIB with no index reg. (The SIB encoding for "no index" is the encoding that would otherwise mean idx=RSP). [rsp + x]
addressing modes require a SIB already (base=RSP is the escape code that means there's a SIB), and that appears all the time in compiler-generated code. So there's very good reason to expect this to be fully efficient to decode and execute (even for base registers other than RSP) now and in the future. NASM syntax can't express this, so you'd have to encode manually. GNU gas Intel syntax from objdump -d
says 8b 04 23 mov eax,DWORD PTR [rbx+riz*1]
for Agner Fog's example 10.20. (riz
is a fictional index-zero notation that means there's a SIB with no index). I haven't tested if GAS accepts that as input.
Use an imm32
and/or disp32
form of an instruction that only needed imm8
or disp0/disp32
. Agner Fog's testing of Sandybridge's uop cache (microarch guide table 9.1) indicates that the actual value of an immediate / displacement is what matters, not the number of bytes used in the instruction encoding. I don't have any info on Ryzen's uop cache.
So NASM imul eax, [dword 4 + rdi], strict dword 13
(10 bytes: opcode + modrm + disp32 + imm32) would use the 32small, 32small category and take 1 entry in the uop cache, unlike if either the immediate or disp32 actually had more than 16 significant bits. (Then it would take 2 entries, and loading it from the uop cache would take an extra cycle.)
According to Agner's table, 8/16/32small are always equivalent for SnB. And addressing modes with a register are the same whether there's no displacement at all, or whether it's 32small, so mov dword [dword 0 + rdi], 123456
takes 2 entries, just like mov dword [rdi], 123456789
. I hadn't realized [rdi]
+ full imm32 took 2 entries, but apparently that' is the case on SnB.
Use jmp / jcc rel32
instead of rel8
. Ideally try to expand instructions in places that don't require longer jump encodings outside the region you're expanding. Pad after jump targets for earlier forward jumps, pad before jump targets for later backward jumps, if they're close to needing a rel32 somewhere else. i.e. try to avoid padding between a branch and its target, unless you want that branch to use a rel32 anyway.
You might be tempted to encode mov eax, [symbol]
as 6-byte a32 mov eax, [abs symbol]
in 64-bit code, using an address-size prefix to use a 32-bit absolute address. But this does cause a Length-Changing-Prefix stall when it decodes on Intel CPUs. Fortunately, none of NASM/YASM / gas / clang do this code-size optimization by default if you don't explicitly specify a 32-bit address-size, instead using 7-byte mov r32, r/m32
with a ModR/M+SIB+disp32 absolute addressing mode for mov eax, [abs symbol]
.
In 64-bit position-dependent code, absolute addressing is a cheap way to use 1 extra byte vs. RIP-relative. But note that 32-bit absolute + immediate takes 2 cycles to fetch from uop cache, unlike RIP-relative + imm8/16/32 which takes only 1 cycle even though it still uses 2 entries for the instruction. (e.g. for a mov
-store or a cmp
). So cmp [abs symbol], 123
is slower to fetch from the uop cache than cmp [rel symbol], 123
, even though both take 2 entries each. Without an immediate, there's no extra cost for
Note that PIE executables allow ASLR even for the executable, and are the default in many Linux distro, so if you can keep your code PIC without any perf downsides, then that's preferable.
Use a REX prefix when you don't need one, e.g. db 0x40
/ add eax, ecx
.
It's not in general safe to add prefixes like rep that current CPUs ignore, because they might mean something else in future ISA extensions.
Repeating the same prefix is sometimes possible (not with REX, though). For example, db 0x66, 0x66
/ add ax, bx
gives the instruction 3 operand-size prefixes, which I think is always strictly equivalent to one copy of the prefix. Up to 3 prefixes is the limit for efficient decoding on some CPUs. But this only works if you have a prefix you can use in the first place; you usually aren't using 16-bit operand-size, and generally don't want 32-bit address-size (although it's safe for accessing static data in position-dependent code).
A ds
or ss
prefix on an instruction that accesses memory is a no-op, and probably doesn't cause any slowdown on any current CPUs. (@prl suggested this in comments).
In fact, Agner Fog's microarch guide uses a ds
prefix on a movq
[esi+ecx],mm0
in Example 7.1. Arranging IFETCH blocks to tune a loop for PII/PIII (no loop buffer or uop cache), speeding it up from 3 iterations per clock to 2.
Some CPUs (like AMD) decode slowly when instructions have more than 3 prefixes. On some CPUs, this includes the mandatory prefixes in SSE2 and especially SSSE3 / SSE4.1 instructions. In Silvermont, even the 0F escape byte counts.
AVX instructions can use a 2 or 3-byte VEX prefix. Some instructions require a 3-byte VEX prefix (2nd source is x/ymm8-15, or mandatory prefixes for SSSE3 or later). But an instruction that could have used a 2-byte prefix can always be encoded with a 3-byte VEX. NASM or GAS {vex3} vxorps xmm0,xmm0
. If AVX512 is available, you can use 4-byte EVEX as well.
Use 64-bit operand-size for mov
even when you don't need it, for example mov rax, strict dword 1
forces the 7-byte sign-extended-imm32 encoding in NASM, which would normally optimize it to 5-byte mov eax, 1
.
mov eax, 1 ; 5 bytes to encode (B8 imm32)
mov rax, strict dword 1 ; 7 bytes: REX mov r/m64, sign-extended-imm32.
mov rax, strict qword 1 ; 10 bytes to encode (REX B8 imm64). movabs mnemonic for AT&T.
You could even use mov reg, 0
instead of xor reg,reg
.
mov r64, imm64
fits efficiently in the uop cache when the constant is actually small (fits in 32-bit sign extended.) 1 uop-cache entry, and load-time = 1, the same as for mov r32, imm32
. Decoding a giant instruction means there's probably not room in a 16-byte decode block for 3 other instructions to decode in the same cycle, unless they're all 2-byte. Possibly lengthening multiple other instructions slightly can be better than having one long instruction.
Decode penalties for extra prefixes:
- P5: prefixes prevent pairing, except for address/operand-size on PMMX only.
- PPro to PIII: There is always a penalty if