x86 - Can the simple decoders in recent Intel microarchitectures handle all 1-µop instructions?

Question

Welcome To Ask or Share your Answers For Others

x86 - Can the simple decoders in recent Intel microarchitectures handle all 1-µop instructions?

1 Reply

深蓝 · Answer 1 · 2021-10-17T01:15:12+0000

No, there are some instructions that can only decode 1/clock

Andreas's comments indicate that xor eax,eax / setnle al seems to have a decode bottleneck of 1/clock. I found the same thing with cdq: Reads EAX, writes EDX, also demonstrably runs faster from the DSB (uop cache), and doesn't involve partial-registers or anything at all weird, and doesn't need a dep-breaking instruction.

Even better, being a single-byte instruction it can defeat the DSB with only a short block of instructions. (Leading to misleading results from testing on some CPUs, e.g. in Agner Fog's tables and on https://uops.info/, e.g. SKX shown as 1c throughput.) https://www.uops.info/html-tp/SKX/CDQ-Measurements.html vs. https://www.uops.info/html-tp/CFL/CDQ-Measurements.html have inconsistent throughputs because of different testing methods: only the Coffee Lake test ever tested with a small enough unroll count (10) to not bust the DSB, finding a throughput of 0.6. (The actual throughput is 0.5 once you account for loop overhead, fully explained by back-end port pressure same as cqo. IDK why you'd find 0.6 instead of 0.55 with only one extra uop for p6 in the loop.)

(Zen can run this instructions with 0.25c throughput; no weird decode problems and handled by every integer-ALU port.)

times 10 cdq in a dec/jnz loop can run from the uop cache, and runs at 0.5c throughput on Skylake (p06), plus loop overhead which also competes for p6.

times 20 cdq is more than 3 uop cache lines for one 32-byte block of machine code, meaning the loop can only run from legacy decode (with the top of the loop aligned). On Skylake this runs at 1 cycle per cdq. Perf counters confirm MITE delivers 1 uop per cycle, rather than groups of 3 or 4 with idle cycles between.

default rel
%ifdef __YASM_VER__
    CPU Skylake AMD
%else
%use smartalign
alignmode p6, 64
%endif

global _start
_start:
    mov  ebp, 1000000000

align 64
.loop:
    ;times 10 cdq   ; 0.5c throughput
    ;times 20 cdq   ; 1c throughput, 1 MITE uop per cycle front-end

    ; times 10 cqo        ; 0.5c throughput 2-byte insn fits uop cache
    ; times 10 cdqe       ; 1c throughput data dependency
    ;times 10 cld         ; ~4c throughput, 3 uops

    dec ebp
    jnz .loop
.end:

    xor edi,edi
    mov eax,231   ; __NR_exit_group  from /usr/include/asm/unistd_64.h
    syscall       ; sys_exit_group(0)

On my Arch Linux desktop, I built this into a static executable to run under perf:

i7-6700k with epp=balance_performance (max "turbo" = 3.9GHz)
microcode revision 0xd6 (so LSD disabled, not that it matters: loops can only run from the LSD loop buffer if all their uops are in the DSB uop cache, IIRC.)

     in a bash shell:
t=cdq-latency; nasm -f elf64 "$t".asm && ld -o "$t" "$t.o" && objdump -drwC -Mintel "$t" && taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,frontend_retired.dsb_miss,idq.dsb_uops,idq.mite_uops,idq.mite_cycles,idq_uops_not_delivered.core,idq_uops_not_delivered.cycles_fe_was_ok,idq.all_mite_cycles_4_uops ./"$t"

disassembly

0000000000401000 <_start>:
  401000:       bd 00 ca 9a 3b          mov    ebp,0x3b9aca00
  401005:       0f 1f 84 00 00 00 00 00         nop    DWORD PTR [rax+rax*1+0x0]
...
  40103d:       0f 1f 00                nop    DWORD PTR [rax]

0000000000401040 <_start.loop>:
  401040:       99                      cdq    
  401041:       99                      cdq    
  401042:       99                      cdq    
  401043:       99                      cdq    
...
  401052:       99                      cdq    
  401053:       99                      cdq             # 20 total CDQ
  401054:       ff cd                   dec    ebp
  401056:       75 e8                   jne    401040 <_start.loop>

0000000000401058 <_start.end>:
  401058:       31 ff                   xor    edi,edi
  40105a:       b8 e7 00 00 00          mov    eax,0xe7
  40105f:       0f 05                   syscall

Perf results:

 Performance counter stats for './cdq-latency':

          5,205.44 msec task-clock                #    1.000 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                 1      page-faults               #    0.000 K/sec                  
    20,124,711,776      cycles                    #    3.866 GHz                      (49.88%)
    22,015,118,295      instructions              #    1.09  insn per cycle           (59.91%)
    21,004,212,389      uops_issued.any           # 4035.049 M/sec                    (59.97%)
     1,005,872,141      frontend_retired.dsb_miss #  193.235 M/sec                    (60.03%)
                 0      idq.dsb_uops              #    0.000 K/sec                    (60.08%)
    20,997,157,414      idq.mite_uops             # 4033.694 M/sec                    (60.12%)
    19,996,447,738      idq.mite_cycles           # 3841.451 M/sec                    (40.03%)
    59,048,559,790      idq_uops_not_delivered.core # 11343.621 M/sec                   (39.97%)
       112,956,733      idq_uops_not_delivered.cycles_fe_was_ok #   21.700 M/sec                    (39.92%)
           209,490      idq.all_mite_cycles_4_uops #    0.040 M/sec                    (39.88%)

       5.206491348 seconds time elapsed

So the loop overhead (dec/jnz) happened basically for free, decoding in the same cycle as the last cdq. Counts are not exact because I used too many events in one run (with HT enabled), so perf did software multiplexing. From another run with fewer counters:

# same source, only these HW counters enabled to avoid multiplexing
          5,161.14 msec task-clock                #    1.000 CPUs utilized          

    20,107,065,550      cycles                    #    3.896 GHz                    
    20,000,134,955      idq.mite_cycles           # 3875.142 M/sec                  
    59,050,860,720      idq_uops_not_delivered.core # 11441.447 M/sec                 
        95,968,317      idq_uops_not_delivered.cycles_fe_was_ok #   18.594 M/sec

So we can see that MITE (legacy decode) was active basically every cycle, and that the front-end was basically never "ok". (i.e. never stalled on the back-end).

With only 10 CDQ instructions, allowing the DSB to work:

...
0000000000401040 <_start.loop>:
  401040:       99                      cdq    
  401041:       99                      cdq    
...
  401049:       99                      cdq        # 10 total CDQ insns
  40104a:       ff cd                   dec    ebp
  40104c:       75 f2                   jne    401040 <_start.loop>

 Performance counter stats for './cdq-latency' (4 runs):

          1,417.38 msec task-clock                #    1.000 CPUs utilized            ( +-  0.03% )
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                 1      page-faults               #    0.001 K/sec                  
     5,511,283,047      cycles                    #    3.888 GHz                      ( +-  0.03% )  (49.83%)
    11,997,247,694      instructions              #    2.18  insn per cycle           ( +-  0.00% )  (59.99%)
    10,999,182,841      uops_issued.any           # 7760.224 M/sec                    ( +-  0.00% )  (60.17%)
           197,753      frontend_retired.dsb_miss #    0.140 M/sec                    ( +- 13.62% )  (60.21%)
    10,988,958,908      idq.dsb_uops              # 7753.010 M/sec                    ( +-  0.03% )  (60.21%)
        10,234,859      idq.mite_uops             #    7.221 M/sec                    ( +- 27.43% )  (60.21%)
         8,114,909      idq.mite_cycles           #    5.725 M/sec                    ( +- 26.11% )  (39.83%)
        40,588,332      idq_uops_not_delivered.core #   28.636 M/sec                    ( +- 21.83% )  (39.79%)
     5,502,581,002      idq_uops_not_delivered.cycles_fe_was_ok # 3882.221 M/sec                    ( +-  0.01% )  (39.79%)
            56,223      idq.all_mite_cycles_4_uops #    0.040 M/sec                    ( +-  3.32% )  (39.79%)

          1.417599 +- 0.000489 seconds time elapsed  ( +-  0.03% )

As reported by idq_uops_not_delivered.cycles_fe_was_ok, basically all the unused front-end uop slots were the fault of the back-end (port pressure on p0 / p6), not the front-end.

Categories

x86 - Can the simple decoders in recent Intel microarchitectures handle all 1-µop instructions?