In the decoders and uop-cache, addressing mode doesn't affect micro-fusion (except that an instruction with an immediate operand can't micro-fuse a RIP-relative addressing mode).
But some combinations of uop and addressing mode can't stay micro-fused in the ROB (in the out-of-order core), so Intel SnB-family CPUs "un-laminate" when necessary, at some point before the issue/rename stage. For issue-throughput, and out-of-order window size (ROB-size), fused-domain uop count after un-lamination is what matters.
Intel's optimization manual describes un-lamination for Sandybridge in Section 2.5.2.4: Micro-op Queue and the Loop Stream Detector (LSD), but doesn't describe the changes for any later microarchitectures.
UPDATE: Now Intel manual has a detailed section to describe un-lamination for Haswell. See section 2.4.5 Unlamination. And a brief description for SandyBridge is in section 2.5.2.4.
The rules, as best I can tell from experiments on SnB, HSW, and SKL:
- SnB (and I assume also IvB): indexed addressing modes are always un-laminated, others stay micro-fused. IACA is (mostly?) correct.
- HSW, SKL: These only keep an indexed ALU instruction micro-fused if it has 2 operands and treats the dst register as read-modify-write. Here "operands" includes flags, meaning that
adc
and cmov
don't micro-fuse. Most VEX-encoded instructions also don't fuse since they generally have three operands (so paddb xmm0, [rdi+rbx]
fuses but vpaddb xmm0, xmm0, [rdi+rbx]
doesn't). Finally, the occasional 2-operand instruction where the first operand is write only, such as pabsb xmm0, [rax + rbx]
also do not fuse. IACA is wrong, applying the SnB rules.
Related: simple (non-indexed) addressing modes are the only ones that the dedicated store-address unit on port7 (Haswell and later) can handle, so it's still potentially useful to avoid indexed addressing modes for stores. (A good trick for this is to address your dst with a single register, but src with dst+(initial_src-initial_dst)
. Then you only have to increment the dst register inside a loop.)
Note that some instructions never micro-fuse at all (even in the decoders/uop-cache). e.g. shufps xmm, [mem], imm8
, or vinsertf128 ymm, ymm, [mem], imm8
, are always 2 uops on SnB through Skylake, even though their register-source versions are only 1 uop. This is typical for instructions with an imm8 control operand plus the usual dest/src1, src2 register/memory operands, but there are a few other cases. e.g. PSRLW/D/Q xmm,[mem]
(vector shift count from a memory operand) doesn't micro-fuse, and neither does PMULLD.
See also this post on Agner Fog's blog for discussion about issue throughput limits on HSW/SKL when you read lots of registers: Lots of micro-fusion with indexed addressing modes can lead to slowdowns vs. the same instructions with fewer register operands: one-register addressing modes and immediates. We don't know the cause yet, but I suspect some kind of register-read limit, maybe related to reading lots of cold registers from the PRF.
Test cases, numbers from real measurements: These all micro-fuse in the decoders, AFAIK, even if they're later un-laminated.
# store
mov [rax], edi SnB/HSW/SKL: 1 fused-domain, 2 unfused. The store-address uop can run on port7.
mov [rax+rsi], edi SnB: unlaminated. HSW/SKL: stays micro-fused. (The store-address can't use port7, though).
mov [buf +rax*4], edi SnB: unlaminated. HSW/SKL: stays micro-fused.
# normal ALU stuff
add edx, [rsp+rsi] SnB: unlaminated. HSW/SKL: stays micro-fused.
# I assume the majority of traditional/normal ALU insns are like add
Three-input instructions that HSW/SKL may have to un-laminate
vfmadd213ps xmm0,xmm0,[rel buf] HSW/SKL: stays micro-fused: 1 fused, 2 unfused.
vfmadd213ps xmm0,xmm0,[rdi] HSW/SKL: stays micro-fused
vfmadd213ps xmm0,xmm0,[0+rdi*4] HSW/SKL: un-laminated: 2 uops in fused & unfused-domains.
(So indexed addressing mode is still the condition for HSW/SKL, same as documented by Intel for SnB)
# no idea why this one-source BMI2 instruction is unlaminated
# It's different from ADD in that its destination is write-only (and it uses a VEX encoding)
blsi edi, [rdi] HSW/SKL: 1 fused-domain, 2 unfused.
blsi edi, [rdi+rsi] HSW/SKL: 2 fused & unfused-domain.
adc eax, [rdi] same as cmov r, [rdi]
cmove ebx, [rdi] Stays micro-fused. (SnB?)/HSW: 2 fused-domain, 3 unfused domain.
SKL: 1 fused-domain, 2 unfused.
# I haven't confirmed that this micro-fuses in the decoders, but I'm assuming it does since a one-register addressing mode does.
adc eax, [rdi+rsi] same as cmov r, [rdi+rsi]
cmove ebx, [rdi+rax] SnB: untested, probably 3 fused&unfused-domain.
HSW: un-laminated to 3 fused&unfused-domain.
SKL: un-laminated to 2 fused&unfused-domain.
I assume that Broadwell behaves like Skylake for adc/cmov.
It's strange that HSW un-laminates memory-source ADC and CMOV. Maybe Intel didn't get around to changing that from SnB before they hit the deadline for shipping Haswell.
Agner's insn table says cmovcc r,m
and adc r,m
don't micro-fuse at all on HSW/SKL, but that doesn't match my experiments. The cycle counts I'm measuring match up with the the fused-domain uop issue count, for a 4 uops / clock issue bottleneck. Hopefully he'll double-check that and correct the tables.
Memory-dest integer ALU:
add [rdi], eax SnB: untested (Agner says 2 fused-domain, 4 unfused-domain (load + ALU + store-address + store-data)
HSW/SKL: 2 fused-domain, 4 unfused.
add [rdi+rsi], eax SnB: untested, probably 4 fused & unfused-domain
HSW/SKL: 3 fused-domain, 4 unfused. (I don't know which uop stays fused).
HSW: About 0.95 cycles extra store-forwarding latency vs. [rdi] for the same address used repeatedly. (6.98c per iter, up from 6.04c for [rdi])
SKL: 0.02c extra latency (5.45c per iter, up from 5.43c for [rdi]), again in a tiny loop with dec ecx/jnz
adc [rdi], eax SnB: untested
HSW: 4 fused-domain, 6 unfused-domain. (same-address throughput 7.23c with dec, 7.19c with sub ecx,1)
SKL: 4 fused-domain, 6 unfused-domain. (same-address throughput ~5.25c with dec, 5.28c with sub)
adc [rdi+rsi], eax SnB: untested
HSW: 5 fused-domain, 6 unfused-domain. (same-address throughput = 7.03c)
SKL: 5 fused-domain, 6 unfused-domain. (same-address throughput = ~5.4c with sub ecx,1 for the loop branch, or 5.23c with dec ecx for the loop branch.)
Yes, that's right, adc [rdi],eax
/ dec ecx
/ jnz
runs faster than the same loop with add
instead of adc
on SKL. I didn't try using different addresses, since clearly SKL doesn't like repeated rewrites of the same address (store-forwarding latency higher than expected. See also this post about repeated store/reload to the same address being slower than expected on SKL.
Memory-destination adc
is so many uops because Intel P6-family (and apparently SnB-family) can't keep the same TLB entries for all the uops of a multi-uop instruction, so it needs an extra uop to work around the problem-case where the load and add complete, and then the store faults, but the insn can't just be restarted because CF has already been updated. Interesting series of comments from Andy Glew (@krazyglew).
Presumably fusion in the decoders and un-lamination later saves us from needing microcode ROM to produce more than 4 fused-domain uops from a single instruction for adc [base+idx], reg
.
Why SnB-family un-laminates:
Sandybridge simplified the internal uop format to save power and transistors (along with making the major change to using a physical register file, instead of keeping input / output data in the ROB). SnB-family CPUs only allow a limited number of input registers for a fused-domain uop in the out-of-order core. For SnB/IvB, that limit is 2 inputs (including flags). For HSW and later, the limit is 3 inputs for a uop. I'm not sure if memory-destination add
and adc
are taking full advantage of that, or if Intel had to get Haswell out the door with some instructions
Nehalem and earlier have a limit of 2 inputs for an unfused-domain uop, but the ROB can apparently track micro-fused uops with 3 input registers (the non-memory register operand, base, and index).
So indexed stores and ALU+load instructions can still decode efficiently (not having to be the first uop in a group), and don't take extra space in the uop cache, but otherwise the advantages of micro-fusion are essentially gone for tuning tight loops. "un-lamination" happens before the 4-fused-domain-uops-per-cycle issue/retire width out-of-order core. The fused-domain performance counters (uops_issued / uops_retired.retire_slots) count fused-domain uops after un-lamination.
Intel's description of the renamer (Section 2.3.3.1: Renamer) implies that it's the issue/rename stage which actually does the un-lamination, so uops destined for un-lamination may still be micro-fused in the 28/56/64 fused-domain uop issue queue / loop-buffer (aka the IDQ).
TODO: test this. Make a loop that should just barely fit in the loop buffer. Change something so one of the uops will be un-laminated before issuing, and see if it still runs from the loop buffer (LSD), or if all the uops are now re-fetched from the uop cache (DSB). There are perf counters to track where uops come from, so this should be easy.
Harder TODO: if un-lamination happens between reading from the uop cache and adding to the IDQ, test whether it can ever reduce uop-cache bandwidth. Or if un-lamination happens right at the issue stage, can it hurt issue throughput? (i.e. how does it handle the leftover uops after issuing the first 4.)
(See the a previous version of this answer for some guesses based on tuning some LUT code, with some notes on vpgatherdd
being about 1.7x more cycles than a pinsrw
loop.)
Experimental testing on SnB
The HSW/SKL numbers were measured on an i5-4210U and an i7-6700k. Both had HT enabled (but the system idle so the thread had the whole core to itself). I ran the same static binaries on both systems, Linux 4.10 on SKL and Linux 4.8 on HSW, using ocperf.py
. (The HSW laptop NFS-mounted my SKL desktop's /home.)
The SnB numbers were measured as described below,