x86 - Why jnz requires 2 cycles to complete in an inner loop

Question

Welcome To Ask or Share your Answers For Others

x86 - Why jnz requires 2 cycles to complete in an inner loop

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

x86 - Why jnz requires 2 cycles to complete in an inner loop

I'm on an IvyBridge. I found the performance behavior of jnz inconsistent in inner loop and outer loop.

The following simple program has an inner loop with fixed size 16:

global _start
_start:
    mov rcx, 100000000
.loop_outer:
    mov rax,    16

.loop_inner:
    dec rax
    jnz .loop_inner

    dec rcx
    jnz .loop_outer

    xor edi, edi
    mov eax, 60
    syscall

perf tool shows the outer loop runs 32c/iter. It suggests the jnz requires 2 cycles to complete.

I then search in Agner's instruction table, conditional jump has 1-2 "reciprocal throughput", with a comment "fast if no jump".

At this point I start to believe the above behavior is somehow expected. But why does jnz in an outer loop only require 1 cycle to complete?

If I remove the .loop_inner part altogether, the outer loop runs 1c/iter. The behavior looks inconsistent.

What I am missing here?

Edit for more info:

The perf results for the above program with command:

perf stat -ecycles,branches,branch-misses,lsd.uops,uops_issued.any -r4 ./a.out

is:

 3,215,921,579      cycles                                                        ( +-  0.11% )  (79.83%)
 1,701,361,270      branches                                                      ( +-  0.02% )  (80.05%)
        19,212      branch-misses             #    0.00% of all branches          ( +- 17.72% )  (80.09%)
        31,052      lsd.uops                                                      ( +- 76.58% )  (80.09%)
 1,803,009,428      uops_issued.any                                               ( +-  0.08% )  (79.93%)

The perf result of the reference case:

global _start
_start:
    mov rcx, 100000000
.loop_outer:
    mov rax,    16
    dec rcx
    jnz .loop_outer

    xor edi, edi
    mov eax, 60
    syscall

is:

   100,978,250      cycles                                                        ( +-  0.66% )  (75.75%)
   100,606,742      branches                                                      ( +-  0.59% )  (75.74%)
         1,825      branch-misses             #    0.00% of all branches          ( +- 13.15% )  (81.22%)
   199,698,873      lsd.uops                                                      ( +-  0.07% )  (87.87%)
   200,300,606      uops_issued.any                                               ( +-  0.12% )  (79.42%)

So the cause is mostly clear: LSD stops working for some reason in the nested case. Reducing the inner loop size will slightly mitigate the slowness, but not completely.

Searching Intel "optimization manual", I found that LSD won't work if the loop contains "more than eight taken branches". This somehow explains the behavior.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T20:04:42+0000

TL;DR: The DSB seems to be only able to deliver one jump uop of the inner loop every other cycle. Also DSB-MITE switches constitute up to 9% of execution time.

Introduction - Part 1: Understanding the LSD performance events

I'll first discuss when the LSD.UOPS and LSD.CYCLES_ACTIVE performance events occur and some peculiarities of the LSD on the IvB and SnB microarchitectures. Once we establish this foundation, we can then answer the question. To do this, we can use small pieces of code that are especially designed to determine accurately when these events occur.

According to the documentation:

LSD.UOPS: Number of Uops delivered by the LSD.
LSD.CYCLES_ACTIVE: Cycles Uops delivered by the LSD, but didn't come from the decoder.

These definitions are useful, but, as you'll see later, not precise enough to answer your question. It's important to develop a better understand of these events. Some of the information presented here is not documented by Intel and it's just my best interpretation of the empirical results and some of the related patents that I went through. Although I was not able to find the specific patent that describes the LSD implementation in SnB or later microarchitectures.

Each of the following benchmarks start with a comment that contains the name of the benchmark. All numbers are normalized per iteration, unless otherwise mentioned.

; B1
----------------------------------------------------
    mov rax, 100000000
.loop:
    dec rax
    jnz .loop
----------------------------------------------------
Metric                             |  IvB   |  SnB
----------------------------------------------------
cycles                             |  0.90  |  1.00
LSD.UOPS                           |  0.99  |  1.99
LSD.CYCLES_ACTIVE                  |  0.49  |  0.99
CYCLE_ACTIVITY.CYCLES_NO_EXECUTE   |  0.00  |  0.00
UOPS_ISSUED.STALL_CYCLES           |  0.43  |  0.50

Both instructions in the loop body are mac-fused into a single uop. There is only one execution port on IvB and SnB that can execute jump instructions. Therefore, the maximum throughput should be 1c/iter. IvB is 10% faster, though, for some reason.

According to Is performance reduced when executing loops whose uop count is not a multiple of processor width?, the LSD in IvB and SnB cannot issue uops across the loop body boundaries even if there are available issue slots. Since the loop contains a single uop, we expect that the LSD will issue a single uop per cycle and that LSD.CYCLES_ACTIVE should about equal to the total number of cycles.

On IvB, LSD.UOPS is as expected. That is, the LSD will issue one uop per cycle. Note that since the number of cycles is equal to the number of iterations which is equal to the number of uops, we can equivalently say that the LSD issues one uop per iteration. Essentially, most of the uops that were execute were issued from the LSD. However, LSD.CYCLES_ACTIVE is about half of the number of cycles. How is this possible? In this case, shouldn't only half of the total number of uops be issued from the LSD? I think what is happening here is that the loop is being essentially unrolled twice and two uops are being issued per cycle. Nonetheless, only a single uop can be executed per cycle yet RESOURCE_STALLS.RS is zero, indicating that RS never gets full. However, RESOURCE_STALLS.ANY is about half of the cycle count. Putting all of this together now, it seems that the LSD is actually issuing 2 uops every other cycle and that there is some structural limitation that is being reached every other cycle. CYCLE_ACTIVITY.CYCLES_NO_EXECUTE confirms that there is always at least one read uop in the RS at any given cycle. The following experiments will reveal the conditions for the unrolling to happen.

On SnB, LSD.UOPS shows that twice the total number of uops were issued from the LSD. Also LSD.CYCLES_ACTIVE indicates the LSD was active most of the time. CYCLE_ACTIVITY.CYCLES_NO_EXECUTE and UOPS_ISSUED.STALL_CYCLES are as on IvB. The following experiments are helpful to understand what's happening. It seems that the measured LSD.CYCLES_ACTIVE is equal to the real LSD.CYCLES_ACTIVE+RESOURCE_STALLS.ANY. Therefore, to get the real LSD.CYCLES_ACTIVE, RESOURCE_STALLS.ANY must be subtracted from the measured LSD.CYCLES_ACTIVE. The same applies to LSD.CYCLES_4_UOPS. The real LSD.UOPS can be calculated as follows:

LSD.UOPS_measured = LSD.UOPS_real + ((LSD.UOPS_measured/LSD.CYCLES_ACTIVE_measured)*RESOURCE_STALLS.ANY)

Thus,

LSD.UOPS_real = LSD.UOPS_measured - ((LSD.UOPS_measured/LSD.CYCLES_ACTIVE_measured) * RESOURCE_STALLS.ANY)
= LSD.UOPS_measured * (1 - (RESOURCE_STALLS.ANY/LSD.CYCLES_ACTIVE_measured))

For all of the benchmarks I've run on SnB (including those not shown here), these adjustments are accurate.

Note that RESOURCE_STALLS.RS and RESOURCE_STALLS.ANY on SnB are just like IvB. So it seems that the LSD works the same way, as far as this particular benchmark is concerned, on IvB and SnB, except that the events LSD.UOPS and LSD.CYCLES_ACTIVE are counted differently.

; B2
----------------------------------------------------
    mov rax, 100000000
    mov rbx, 0
.loop:
    dec rbx
    jz .loop
    dec rax
    jnz .loop
----------------------------------------------------
Metric                             |  IvB   |  SnB
----------------------------------------------------
cycles                             |  1.98  |  2.00
LSD.UOPS                           |  1.92  |  3.99
LSD.CYCLES_ACTIVE                  |  0.94  |  1.99
CYCLE_ACTIVITY.CYCLES_NO_EXECUTE   |  0.00  |  0.00
UOPS_ISSUED.STALL_CYCLES           |  1.00  |  1.00

In B2, there are 2 uops per iteration and both are jumps. The first one is never taken, so there is still only one loop. We expect it to run at 2c/iter, which is indeed the case. LSD.UOPS shows that most uops were issued from LSD, but LSD.CYCLES_ACTIVE shows that the LSD was active only half of the time. This means that the loop was not unrolled. So it seems that unrolling only occurs when there is a single uop in the loop.

; B3
----------------------------------------------------
    mov rax, 100000000
.loop:
    dec rbx
    dec rax
    jnz .loop
----------------------------------------------------
Metric                             |  IvB   |  SnB
----------------------------------------------------
cycles                             |  0.90  |  1.00
LSD.UOPS                           |  1.99  |  1.99
LSD.CYCLES_ACTIVE                  |  0.99  |  0.99
CYCLE_ACTIVITY.CYCLES_NO_EXECUTE   |  0.00  |  0.00
UOPS_ISSUED.STALL_CYCLES           |  0.00  |  0.00

There are also 2 uops here, but the first one is a single-cycle ALU uop that is not related to the jump uop. B3 helps us answer the following two questions:

If the target of a jump is not a jump uop, will the LSD.UOPS and LSD.CYCLES_ACTIVE still count twice on SnB?
If the loop contains 2 uops where only one of them is a jump, will the LSD unroll the loop?

B3 shows that the answer to both question is a "No."

UOPS_ISSUED.STALL_CYCLES suggests that the LSD will only stall one cycle if it issues two jump uops in one cycle. This never happens in B3, so there are no stalls.

; B4
----------------------------------------------------
    mov rax, 100000000
.loop:
    add rbx, qword [buf]
    dec rax
    jnz .loop
----------------------------------------------------
Metric                             |  IvB   |  SnB
----------------------------------------------------
cycles                             |  0.90  |  1.00
LSD.UOPS                           |  1.99  |  2.00
LSD.CYCLES_ACTIVE                  |  0.99  |  1.00
CYCLE_ACTIVITY.CYCLES_NO_EXECUTE   |  0.00  |  0.00
UOPS_ISSUED.STALL_CYCLES           |  0.00  |  0.00

B4 has an additional twist to it; it contains 2 uops in the fused domain but 3 uops in the fused domain because the load-ALU instruction gets unfused in the RS. In the previous benchmarks, there were no micro-fused uops, only macro-fused uops. The goal here is to see how micro-fused uops are treated by the LSD.

LSD.UOPS shows that the two uops of the load-ALU instruction have consumed a single issue slot (the fused jump uop consumes only a single slot). Also since LSD.CYCLES_ACTIVE is equal to cycles, no unrolling has occurred. The loop throughput is as expected.

; B5
----------------------------------------------------
    mov rax, 100000000
.loop:
    jmp .next
.next:
    dec rax
    jnz .loop
----------------------------------------------------
Metric                             |  IvB   |  SnB
----------------------------------------------------
cycles                             |  2.00  |  2.00
LSD.UOPS                           |  1.91  |  3.99
LSD.CYCLES_ACTIVE                  |  0.96  |  1.99
CYCLE_ACTIVITY.CYCLES_NO_EXECUTE   |  0.00  |  0.00
UOPS_ISSUED.STALL_CYCLES           |  1.00  |  1.00

B5 is the last benchmark that we will need. It is similar to B2 in that it contains two branch uops. However, one of the jump uops in B5 is a forward unconditional jump. The results are identical to B2, indicating that it doesn't matter whether a jump uop is conditional or not. This is also case if the first jump uop is conditional and the second is not.

Introduction - Part 2: Branch prediction in the LSD

The LSD is mechanism implemented in the uop queue (IDQ) that can improve performance and reduce power consumption (consequently, heat emission is reduced). It can improve performance because some of the limitations that exist in the frontend may be relaxed in the uop queue. In particular, on SnB and IvB, both the MITE and DSB paths have a maximum throughput of 4uops/c, but in terms of bytes, it's 16B/c and 32B/c, respectively. The uop queue bandwidth is also 4uops/c, but has no limitation on the number of bytes. As long as the LSD issues uops from the uop queue, the frontend (i.e., the fetch and decode units) and even unneeded logic downstream from the IDQ can be powered down. Prior to Nehalem, the LSD was implemented in the IQ unit. Starting with Haswell, the LSD supports loops that contain uops from the MSROM. The LSD in Skylake processors is disabled because, apparently, it's buggy.

Loops usually contain at least one conditional branch. The LSD essentially monitors backward conditional branches and tries to determine a sequence of uops that constitute a loop. If the LSD

Categories

x86 - Why jnz requires 2 cycles to complete in an inner loop

x86 - Why jnz requires 2 cycles to complete in an inner loop

Edit for more info:

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Introduction - Part 1: Understanding the LSD performance events

Introduction - Part 2: Branch prediction in the LSD

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags