No, fusion is totally separate from how one complex instruction (like cpuid
or lock add [mem], eax
) can decode to multiple uops.
The way the retirement stage figures out that all the uops for a single instruction have retired, and thus the instruction has retired, has nothing to do with fusion.
Macro-fusion decodes cmp/jcc or test/jcc into a single compare-and-branch uop. (Intel and AMD CPUs). The rest of the pipeline sees it purely as a single uop1 (except performance counters still count it as 2 instructions). This saves uop cache space, and bandwidth everywhere including decode. In some code, compare-and-branch makes up a significant fraction of the total instruction mix, like maybe 25%, so choosing to look for this fusion rather than other possible fusions like mov dst,src1
/ or dst,src2
makes sense.
Sandybridge-family can also macro-fuse some other ALU instructions with conditional branches, like add
/sub
or inc
/dec
+ JCC with some conditions. (x86_64 - Assembly - loop conditions and out of order)
Micro-fusion stores 2 uops from the same instruction together so they only take up 1 "slot" in the fused-domain parts of the pipeline. But they still have to dispatch separately to separate execution units. And in Intel Sandybridge-family, the RS (Reservation Station aka scheduler) is in the unfused domain, so they're even stored separately in the scheduler. (See Footnote 2 in my answer on Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths.)
P6 family had a fused-domain RS, as well as ROB, so micro-fusion helped increase the effective size of the out-of-order window there. But SnB-family reportedly simplified the uop format making it more compact, allowing larger RS sizes that are helpful all the time, not just for micro-fused instructions.
And Sandybridge family will "un-laminate" indexed addressing modes under some conditions, splitting them back into 2 separate uops in their own slots before issue/rename into the ROB in the out-of-order back end, so you lose the front-end issue/rename throughput benefit of micro-fusion. See Micro fusion and addressing modes
Both can happen at the same time
cmp [rdi], eax
jnz .target
The cmp/jcc can macro-fuse into a single cmp-and-branch ALU uop, and the load from [rdi]
can micro-fuse with that uop.
Failure to micro-fuse the cmp
does not prevent macro-fusion.
The limitations here are: RIP-relative + immediate can never micro-fuse, so cmp dword [static_data], 1
/ jnz
can macro-fuse but not micro-fuse.
A cmp
/jcc
on SnB-family (like cmp [rdi+rax], edx
/ jnz
) will macro and micro-fuse in the decoders, but the micro-fusion will un-laminate before the issue stage. (So it's 2 total uops in both the fused-domain and unfused-domain: load with an indexed addressing mode, and ALU cmp/jnz
). You can verify this with perf counters by putting a mov ecx, 1
in between the CMP and JCC vs. after, and note that uops_issued.any:u
and uops_executed.thread
both go up by 1 per loop iteration because we defeated macro-fusion. And micro-fusion behaved the same.
On Skylake, cmp dword [rdi], 0
/jnz
can't macro-fuse. (Only micro-fuse). I tested with a loop that contained some dummy mov ecx,1
instructions. Reordering so one of those mov
instructions split up the cmp/jcc
didn't change perf counters for fused-domain or unfused-domain uops.
But cmp [rdi],eax
/jnz
does macro- and micro-fuse. Reordering so a mov ecx,1
instruction separates CMP from JNZ does change perf counters (proving macro-fusion), and uops_executed is higher than uops_issued by 1 per iteration (proving micro-fusion).
cmp [rdi+rax], eax
/jne
only macro-fuses; not micro. (Well actually micro-fuses in decode but un-laminates before issue because of the indexed addressing mode, and it's not an RMW-register destination like sub eax, [rdi+rax]
that can keep indexed addressing modes micro-fused. That sub
with an indexed addressing mode does macro- and micro-fuse on SKL, and presumably Haswell).
(The cmp dword [rdi],0
does micro-fuse, though: uops_issued.any:u
is 1 lower than uops_executed.thread
, and the loop contains no nop
or other "eliminated" instructions, or any other memory instructions that could micro-fuse).
Some compilers (including GCC IIRC) prefer to use a separate load instruction and then compare+branch on a register. TODO: check whether gcc and clang's choices are optimal with immediate vs. register.
Micro-operations are those operations that can be executed in 1 clock cycle.
Not exactly. They take 1 "slot" in the pipeline, or in the ROB and RS that track them in the out-of-order back-end.
And yes, dispatching a uop to an execution port happens in 1 clock cycle and simple uops (e.g., integer addition) can complete execution in the same cycle. This can happen for up to 8 uops simultaneously since Haswell, but increased to 10 on Sunny Cove. The actual execution might take more than 1 clock cycle (occupying the execution unit for longer, e.g. FP division).
The divider is I think the only execution unit on modern mainstream Intel that's not fully pipelined, but Knight's Landing has some not-fully-pipelined SIMD shuffles that are single uop but (reciprocal) throughput of 2 cycles.).
Footnote 1:
If cmp [rdi], eax
/ jne
faults on the memory operand, i.e. a #PF
exception, it's taken with the exception return address pointing to before the cmp
. So I think even exception handling can still treat it as a single thing.
Or if the branch target address is bogus, a #PF exception will happen after the branch has already executed, from code fetch with an updated RIP. So again, I don't think there's a way for cmp
to execute successfully and the jcc
to fault, requiring an exception to be taken with RIP pointing to the JCC.
But even if that case is a possibility the CPU needs to be designed to handle, sorting that out can be deferred until the exception is actually detected. Maybe with a microcode assist, or some special-case hardware.
As far as how the cmp/jcc uop goes through the pipeline in the normal case, it works exactly like one long single-uop instruction that both sets flags and conditionally branches.
Surprisingly, the loop
instruction (like dec rcx/jnz
but without setting flags) is not a single uop on Intel CPUs. Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?.