Like Agner Fog's microarch doc explains, the stack engine handles the rsp+=8
/ rsp-=8
part of push/pop / call/ret in the issue stage of the pipeline (before issuing uops into the Out-of-Order (OoO) part of the core).
So the OoO execution part of the core only has to handle the load/store part, with an address generated by the stack engine. It occasionally has to insert a uop to sync its offset from rsp
when the 8bit displacement counter overflows, or when the OoO core needs the value of rsp
directly (e.g. sub rsp, 8
, or mov [rsp-8], eax
after a call
, ret
, push
or pop
typically cause an extra uop to be inserted on Intel CPUs. AMD CPUs apparently don't need extra sync uops).
Note that Agner's instruction tables show that Pentium-M and later decode pop reg
to a single uop which runs only on the load port. But Pentium II/III decodes pop eax
to 2 uops; 1 ALU and 1 load, because there's no stack-engine to handle the ESP adjustment outside of the out-of-order core. Besides taking extra uops, a long chain of push/pop and call/ret creates a serial dependency on ESP so out-of-order execution has to chew through the ALU uops before a value is available for a mov ebp, esp
, or an address for mov eax, [esp+16]
.
The P6 microarch family (PPro to Nehalem) stored the input values for a uop directly in the ROB. At issue/rename, "cold" register inputs are read from the architectural register file into the ROB (which can be a bottleneck, due to limited read ports. See register-read stalls). After executing a uop, the result is written into the ROB for other uops to read. The architectural register file is updated with values from the ROB when uops retire.
SnB-family microarchitectures (and P4) have a physical register file, so the ROB stores register numbers (i.e. a level of indirection) instead of the data directly. Re-Order Buffer is still an excellent name for that part of the CPU.
Note that SnB introduced AVX, with 256b vectors. Making every ROB entry big enough to store double-size vectors was presumably undesirable compared to only keeping them in a smaller FP register file.
SnB simplified the uop format to save power. This did lead to a sacrifice in uop micro-fusion capability, though: the decoders and uop-cache can still micro-fuse memory operands using 2-register (indexed) addressing modes, but they're "unlaminated" before issuing into the OOO core.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…