TL;DR: The DSB seems to be only able to deliver one jump uop of the inner loop every other cycle. Also DSB-MITE switches constitute up to 9% of execution time.
Introduction - Part 1: Understanding the LSD performance events
I'll first discuss when the LSD.UOPS
and LSD.CYCLES_ACTIVE
performance events occur and some peculiarities of the LSD on the IvB and SnB microarchitectures. Once we establish this foundation, we can then answer the question. To do this, we can use small pieces of code that are especially designed to determine accurately when these events occur.
According to the documentation:
LSD.UOPS
: Number of Uops delivered by the LSD.
LSD.CYCLES_ACTIVE
: Cycles Uops delivered by the LSD, but didn't come
from the decoder.
These definitions are useful, but, as you'll see later, not precise enough to answer your question. It's important to develop a better understand of these events. Some of the information presented here is not documented by Intel and it's just my best interpretation of the empirical results and some of the related patents that I went through. Although I was not able to find the specific patent that describes the LSD implementation in SnB or later microarchitectures.
Each of the following benchmarks start with a comment that contains the name of the benchmark. All numbers are normalized per iteration, unless otherwise mentioned.
; B1
----------------------------------------------------
mov rax, 100000000
.loop:
dec rax
jnz .loop
----------------------------------------------------
Metric | IvB | SnB
----------------------------------------------------
cycles | 0.90 | 1.00
LSD.UOPS | 0.99 | 1.99
LSD.CYCLES_ACTIVE | 0.49 | 0.99
CYCLE_ACTIVITY.CYCLES_NO_EXECUTE | 0.00 | 0.00
UOPS_ISSUED.STALL_CYCLES | 0.43 | 0.50
Both instructions in the loop body are mac-fused into a single uop. There is only one execution port on IvB and SnB that can execute jump instructions. Therefore, the maximum throughput should be 1c/iter. IvB is 10% faster, though, for some reason.
According to Is performance reduced when executing loops whose uop count is not a multiple of processor width?, the LSD in IvB and SnB cannot issue uops across the loop body boundaries even if there are available issue slots. Since the loop contains a single uop, we expect that the LSD will issue a single uop per cycle and that LSD.CYCLES_ACTIVE
should about equal to the total number of cycles.
On IvB, LSD.UOPS
is as expected. That is, the LSD will issue one uop per cycle. Note that since the number of cycles is equal to the number of iterations which is equal to the number of uops, we can equivalently say that the LSD issues one uop per iteration. Essentially, most of the uops that were execute were issued from the LSD. However, LSD.CYCLES_ACTIVE
is about half of the number of cycles. How is this possible? In this case, shouldn't only half of the total number of uops be issued from the LSD? I think what is happening here is that the loop is being essentially unrolled twice and two uops are being issued per cycle. Nonetheless, only a single uop can be executed per cycle yet RESOURCE_STALLS.RS
is zero, indicating that RS never gets full. However, RESOURCE_STALLS.ANY
is about half of the cycle count. Putting all of this together now, it seems that the LSD is actually issuing 2 uops every other cycle and that there is some structural limitation that is being reached every other cycle. CYCLE_ACTIVITY.CYCLES_NO_EXECUTE
confirms that there is always at least one read uop in the RS at any given cycle. The following experiments will reveal the conditions for the unrolling to happen.
On SnB, LSD.UOPS
shows that twice the total number of uops were issued from the LSD. Also LSD.CYCLES_ACTIVE
indicates the LSD was active most of the time. CYCLE_ACTIVITY.CYCLES_NO_EXECUTE
and UOPS_ISSUED.STALL_CYCLES
are as on IvB. The following experiments are helpful to understand what's happening. It seems that the measured LSD.CYCLES_ACTIVE
is equal to the real LSD.CYCLES_ACTIVE
+RESOURCE_STALLS.ANY
. Therefore, to get the real LSD.CYCLES_ACTIVE
, RESOURCE_STALLS.ANY
must be subtracted from the measured LSD.CYCLES_ACTIVE
. The same applies to LSD.CYCLES_4_UOPS
. The real LSD.UOPS
can be calculated as follows:
LSD.UOPS
measured = LSD.UOPS
real + ((LSD.UOPS
measured/LSD.CYCLES_ACTIVE
measured)*RESOURCE_STALLS.ANY
)
Thus,
LSD.UOPS
real = LSD.UOPS
measured - ((LSD.UOPS
measured/LSD.CYCLES_ACTIVE
measured) * RESOURCE_STALLS.ANY
)
= LSD.UOPS
measured * (1 - (RESOURCE_STALLS.ANY
/LSD.CYCLES_ACTIVE
measured))
For all of the benchmarks I've run on SnB (including those not shown here), these adjustments are accurate.
Note that RESOURCE_STALLS.RS
and RESOURCE_STALLS.ANY
on SnB are just like IvB. So it seems that the LSD works the same way, as far as this particular benchmark is concerned, on IvB and SnB, except that the events LSD.UOPS
and LSD.CYCLES_ACTIVE
are counted differently.
; B2
----------------------------------------------------
mov rax, 100000000
mov rbx, 0
.loop:
dec rbx
jz .loop
dec rax
jnz .loop
----------------------------------------------------
Metric | IvB | SnB
----------------------------------------------------
cycles | 1.98 | 2.00
LSD.UOPS | 1.92 | 3.99
LSD.CYCLES_ACTIVE | 0.94 | 1.99
CYCLE_ACTIVITY.CYCLES_NO_EXECUTE | 0.00 | 0.00
UOPS_ISSUED.STALL_CYCLES | 1.00 | 1.00
In B2, there are 2 uops per iteration and both are jumps. The first one is never taken, so there is still only one loop. We expect it to run at 2c/iter, which is indeed the case. LSD.UOPS
shows that most uops were issued from LSD, but LSD.CYCLES_ACTIVE
shows that the LSD was active only half of the time. This means that the loop was not unrolled. So it seems that unrolling only occurs when there is a single uop in the loop.
; B3
----------------------------------------------------
mov rax, 100000000
.loop:
dec rbx
dec rax
jnz .loop
----------------------------------------------------
Metric | IvB | SnB
----------------------------------------------------
cycles | 0.90 | 1.00
LSD.UOPS | 1.99 | 1.99
LSD.CYCLES_ACTIVE | 0.99 | 0.99
CYCLE_ACTIVITY.CYCLES_NO_EXECUTE | 0.00 | 0.00
UOPS_ISSUED.STALL_CYCLES | 0.00 | 0.00
There are also 2 uops here, but the first one is a single-cycle ALU uop that is not related to the jump uop. B3 helps us answer the following two questions:
- If the target of a jump is not a jump uop, will the
LSD.UOPS
and LSD.CYCLES_ACTIVE
still count twice on SnB?
- If the loop contains 2 uops where only one of them is a jump, will the LSD unroll the loop?
B3 shows that the answer to both question is a "No."
UOPS_ISSUED.STALL_CYCLES
suggests that the LSD will only stall one cycle if it issues two jump uops in one cycle. This never happens in B3, so there are no stalls.
; B4
----------------------------------------------------
mov rax, 100000000
.loop:
add rbx, qword [buf]
dec rax
jnz .loop
----------------------------------------------------
Metric | IvB | SnB
----------------------------------------------------
cycles | 0.90 | 1.00
LSD.UOPS | 1.99 | 2.00
LSD.CYCLES_ACTIVE | 0.99 | 1.00
CYCLE_ACTIVITY.CYCLES_NO_EXECUTE | 0.00 | 0.00
UOPS_ISSUED.STALL_CYCLES | 0.00 | 0.00
B4 has an additional twist to it; it contains 2 uops in the fused domain but 3 uops in the fused domain because the load-ALU instruction gets unfused in the RS. In the previous benchmarks, there were no micro-fused uops, only macro-fused uops. The goal here is to see how micro-fused uops are treated by the LSD.
LSD.UOPS
shows that the two uops of the load-ALU instruction have consumed a single issue slot (the fused jump uop consumes only a single slot). Also since LSD.CYCLES_ACTIVE
is equal to cycles
, no unrolling has occurred. The loop throughput is as expected.
; B5
----------------------------------------------------
mov rax, 100000000
.loop:
jmp .next
.next:
dec rax
jnz .loop
----------------------------------------------------
Metric | IvB | SnB
----------------------------------------------------
cycles | 2.00 | 2.00
LSD.UOPS | 1.91 | 3.99
LSD.CYCLES_ACTIVE | 0.96 | 1.99
CYCLE_ACTIVITY.CYCLES_NO_EXECUTE | 0.00 | 0.00
UOPS_ISSUED.STALL_CYCLES | 1.00 | 1.00
B5 is the last benchmark that we will need. It is similar to B2 in that it contains two branch uops. However, one of the jump uops in B5 is a forward unconditional jump. The results are identical to B2, indicating that it doesn't matter whether a jump uop is conditional or not. This is also case if the first jump uop is conditional and the second is not.
Introduction - Part 2: Branch prediction in the LSD
The LSD is mechanism implemented in the uop queue (IDQ) that can improve performance and reduce power consumption (consequently, heat emission is reduced). It can improve performance because some of the limitations that exist in the frontend may be relaxed in the uop queue. In particular, on SnB and IvB, both the MITE and DSB paths have a maximum throughput of 4uops/c, but in terms of bytes, it's 16B/c and 32B/c, respectively. The uop queue bandwidth is also 4uops/c, but has no limitation on the number of bytes. As long as the LSD issues uops from the uop queue, the frontend (i.e., the fetch and decode units) and even unneeded logic downstream from the IDQ can be powered down. Prior to Nehalem, the LSD was implemented in the IQ unit. Starting with Haswell, the LSD supports loops that contain uops from the MSROM. The LSD in Skylake processors is disabled because, apparently, it's buggy.
Loops usually contain at least one conditional branch. The LSD essentially monitors backward conditional branches and tries to determine a sequence of uops that constitute a loop. If the LSD