performance - How are x86 uops scheduled, exactly?

Question

Welcome To Ask or Share your Answers For Others

performance - How are x86 uops scheduled, exactly?

posted Oct 6, 2021 in Technique[技术] by 深蓝 (71.8m points)

performance - How are x86 uops scheduled, exactly?

Modern x86 CPUs break down the incoming instruction stream into micro-operations (uops¹) and then schedule these uops out-of-order as their inputs become ready. While the basic idea is clear, I'd like to know the specific details of how ready instructions are scheduled, since it impacts micro-optimization decisions.

For example, take the following toy loop²:

top:
lea eax, [ecx + 5]
popcnt eax, eax
add edi, eax
dec ecx
jnz top

this basically implements the loop (with the following correspondence: eax -> total, c -> ecx):

do {
  total += popcnt(c + 5);
} while (--c > 0);

I'm familiar with the process of optimizing any small loop by looking at the uop breakdown, dependency chain latencies and so on. In the loop above we have only one carried dependency chain: dec ecx. The first three instructions of the loop (lea, imul, add) are part of a dependency chain that starts fresh each loop.

The final dec and jne are fused. So we have a total of 4 fused-domain uops, and one only loop-carried dependency chain with a latency of 1 cycle. So based on that criteria, it seems that the loop can execute at 1 cycle/iteration.

However, we should look at the port pressure too:

The lea can execute on ports 1 and 5
The popcnt can execute on port 1
The add can execute on port 0, 1, 5 and 6
The predicted-taken jnz executes on port 6

So to get to 1 cycle / iteration, you pretty much need the following to happen:

The popcnt must execute on port 1 (the only port it can execute on)
The lea must execute on port 5 (and never on port 1)
The add must execute on port 0, and never on any of other three ports it can execute on
The jnz can only execute on port 6 anyway

That's a lot of conditions! If instructions just got scheduled randomly, you could get a much worse throughput. For example, 75% the add would go to port 1, 5 or 6, which would delay the popcnt, lea or jnz by one cycle. Similarly for the lea which can go to 2 ports, one shared with popcnt.

IACA on the other hand reports a result very close to optimal, 1.05 cycles per iteration:

Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - l.o
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 1.05 Cycles       Throughput Bottleneck: FrontEnd, Port0, Port1, Port5

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 1.0    0.0  | 1.0  | 0.0    0.0  | 0.0    0.0  | 0.0  | 1.0  | 0.9  | 0.0  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     |           |           |     | 1.0 |     |     | CP | lea eax, ptr [ecx+0x5]
|   1    |           | 1.0 |           |           |     |     |     |     | CP | popcnt eax, eax
|   1    | 0.1       |     |           |           |     | 0.1 | 0.9 |     | CP | add edi, eax
|   1    | 0.9       |     |           |           |     |     | 0.1 |     | CP | dec ecx
|   0F   |           |     |           |           |     |     |     |     |    | jnz 0xfffffffffffffff4

It pretty much reflects the necessary "ideal" scheduling I mentioned above, with a small deviation: it shows the add stealing port 5 from the lea on 1 out of 10 cycles. It also doesn't know that the fused branch is going to go to port 6 since it is predicted taken, so it puts most of the uops for the branch on port 0, and most of the uops for the add on port 6, rather than the other way around.

It's not clear if the extra 0.05 cycles that IACA reports over the optimal is the result of some deep, accurate analysis, or a less insightful consequence of the algorithm it uses, e.g., analyzing the loop over a fixed number of cycles, or just a bug or whatever. The same goes for the 0.1 fraction of a uop that it thinks will go to the non-ideal port. It is also not clear if one explains the other - I would think that mis-assigning a port 1 out of 10 times would cause a cycle count of 11/10 = 1.1 cycles per iteration, but I haven't worked out the actual downstream results - maybe the impact is less on average. Or it could just be rounding (0.05 == 0.1 to 1 decimal place).

So how do modern x86 CPUs actually schedule? In particular:

When multiple uops are ready in the reservation station, in what order are they scheduled to ports?
When a uop can go to multiple ports (like the add and lea in the example above), how is it decided which port is chosen?
If any of the answers involve a concept like oldest to choose among uops, how is it defined? Age since it was delivered to the RS? Age since it became ready? How are ties broken? Does program order ever come into it?

Results on Skylake

Let's measure some actual results on Skylake to check which answers explain the experimental evidence, so here are some real-world measured results (from perf) on my Skylake box. Confusingly, I'm going switch to using imul for my "only executes on one port" instruction, since it has many variants, including 3-argument versions that allow you to use different registers for the source(s) and destination. This is very handy when trying to construct dependency chains. It also avoids the whole "incorrect dependency on destination" that popcnt has.

Independent Instructions

Let's start by looking at the simple (?) case that the instructions are relatively independent - without any dependency chains other than trivial ones like the loop counter.

Here's a 4 uop loop (only 3 executed uops) with mild pressure. All instructions are independent (don't share any sources or destinations). The add could in principle steal the p1 needed by the imul or p6 needed by the dec:

Example 1

instr   p0 p1 p5 p6 
xor       (elim)
imul        X
add      X  X  X  X
dec               X

top:
    xor  r9, r9
    add  r8, rdx
    imul rax, rbx, 5
    dec esi
    jnz top

The results is that this executes with perfect scheduling at 1.00 cycles / iteration:

   560,709,974      uops_dispatched_port_port_0                                     ( +-  0.38% )
 1,000,026,608      uops_dispatched_port_port_1                                     ( +-  0.00% )
   439,324,609      uops_dispatched_port_port_5                                     ( +-  0.49% )
 1,000,041,224      uops_dispatched_port_port_6                                     ( +-  0.00% )
 5,000,000,110      instructions:u            #    5.00  insns per cycle          ( +-  0.00% )
 1,000,281,902      cycles:u   

                                           ( +-  0.00% )

As expected, p1 and p6 are fully utilized by the imul and dec/jnz respectively, and then the add issues roughly half and half between the remaining available ports. Note roughly - the actual ratio is 56% and 44%, and this ratio is pretty stable across runs (note the +- 0.49% variation). If I adjust the loop alignment, the split changes (53/46 for 32B alignment, more like 57/42 for 32B+4 alignment). Now, we if change nothing except the position of imul in the loop:

Example 2

top:
    imul rax, rbx, 5
    xor  r9, r9
    add  r8, rdx
    dec esi
    jnz top

Then suddenly the p0/p5 split is exactly 50%/50%, with 0.00% variation:

   500,025,758      uops_dispatched_port_port_0                                     ( +-  0.00% )
 1,000,044,901      uops_dispatched_port_port_1                                     ( +-  0.00% )
   500,038,070      uops_dispatched_port_port_5                                     ( +-  0.00% )
 1,000,066,733      uops_dispatched_port_port_6                                     ( +-  0.00% )
 5,000,000,439      instructions:u            #    5.00  insns per cycle          ( +-  0.00% )
 1,000,439,396      cycles:u                                                        ( +-  0.01% )

So that's already interesting, but it's hard to tell what's going on. Perhaps the exact behavior depends on the initial conditions at loop entry and is sensitive to ordering within the loop (e.g., because counters are used). This example shows that something more than "random" or "stupid" scheduling is going on. In particular, if you just eliminate the imul instruction from the loop, you get the following:

Example 3

   330,214,329      uops_dispatched_port_port_0                                     ( +-  0.40% )
   314,012,342      uops_dispatched_port_port_1                                     ( +-  1.77% )
   355,817,739      uops_dispatched_port_port_5                                     ( +-  1.21% )
 1,000,034,653      uops_dispatched_port_port_6                                     ( +-  0.00% )
 4,000,000,160      instructions:u            #    4.00  insns per cycle          ( +-  0.00% )
 1,000,235,522      cycles:u                                                      ( +-  0.00% )

Here, the add is now roughly evenly distributed among p0, p1 and p5 - so the presence of the imul did affect the add scheduling: it wasn't just a consequence of some "avoid port 1" rule.

Note here that total port pressure is only 3 uops/cycle, since the xor is a zeroing idiom and is eliminated in the renamer. Let's try with the max pressure of 4 uops. I expect whatever mechanism kicked in above to able to perfectly schedule this also. We only change xor r9, r9 to xor r9, r10, so it is no longer a zeroing idiom. We get the following results:

Example 4

top:
    xor  r9, r10
    add  r8, rdx
    imul rax, rbx, 5
    dec esi
    jnz top

       488,245,238      uops_dispatched_port_port_0                                     ( +-  0.50% )
     1,2

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T03:29:38+0000

Your questions are tough for a couple of reasons:

The answer depends a lot on the microarchitecture of the processor which can vary significantly from generation to generation.
These are fine-grained details which Intel doesn't generally release to the public.

Nevertheless, I'll try to answer...

When multiple uops are ready in the reservation station, in what order are they scheduled to ports?

It should be the oldest [see below], but your mileage may vary. The P6 microarchitecture (used in the Pentium Pro, 2 & 3) used a reservation station with five schedulers (one per execution port); the schedulers used a priority pointer as a place to start scanning for ready uops to dispatch. It was only pseudo FIFO so it's entirely possible that the oldest ready instruction was not always scheduled. In the NetBurst microarchitecture (used in Pentium 4), they ditched the unified reservation station and used two uop queues instead. These were proper collapsing priority queues so the schedulers were guaranteed to get the oldest ready instruction. The Core architecture returned to a reservation station and I would hazard an educated guess that they used the collapsing priority queue, but I can't find a source to confirm this. If somebody has a definitive answer, I'm all ears.

When a uop can go to multiple ports (like the add and lea in the example above), how is it decided which port is chosen?

That's tricky to know. The best I could find is a patent from Intel describing such a mechanism. Essentially, they keep a counter for each port that has redundant functional units. When the uops leave the front end to the reservation station, they are assigned a dispatch port. If it has to decide between multiple redundant execution units, the counters are used to distribute the work evenly. Counters are incremented and decremented as uops enter and leave the reservation station respectively.

Naturally this is just a heuristic and does not guarantee a perfect conflict-free schedule, however, I could still see it working with your toy example. The instructions which can only go to one port would ultimately influence the scheduler to dispatch the "less restricted" uops to other ports.

In any case, the presence of a patent doesn't necessarily imply that the idea was adopted (although that said, one of the authors was also a tech lead of the Pentium 4, so who knows?)

If any of the answers involve a concept like oldest to choose among uops, how is it defined? Age since it was delivered to the RS? Age since it became ready? How are ties broken? Does program order ever come into it?

Since uops are inserted into the reservation station in order, oldest here does indeed refer to time it entered the reservation station, i.e. oldest in program order.

By the way, I would take those IACA results with a grain of salt as they may not reflect the nuances of the real hardware. On Haswell, there is a hardware counter called uops_executed_port that can tell you how many cycles in your thread were uops issues to ports 0-7. Maybe you could leverage these to get a better understanding of your program?

Categories

performance - How are x86 uops scheduled, exactly?

performance - How are x86 uops scheduled, exactly?

Results on Skylake

Independent Instructions

Example 1

Example 2

Example 3

Example 4

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags