Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
550 views
in Technique[技术] by (71.8m points)

c++ - 将32位循环计数器替换为64位会在Intel CPU上使用_mm_popcnt_u64引起疯狂的性能偏差(Replacing a 32-bit loop counter with 64-bit introduces crazy performance deviations with _mm_popcnt_u64 on Intel CPUs)

I was looking for the fastest way to popcount large arrays of data.

(我一直在寻找最快的方法来popcount大量数据的数量。)

I encountered a very weird effect: Changing the loop variable from unsigned to uint64_t made the performance drop by 50% on my PC.

(我遇到了一个非常奇怪的效果:将循环变量从unsigned更改为uint64_t使PC上的性能下降了50%。)

The Benchmark (基准测试)

#include <iostream>
#include <chrono>
#include <x86intrin.h>

int main(int argc, char* argv[]) {

    using namespace std;
    if (argc != 2) {
       cerr << "usage: array_size in MB" << endl;
       return -1;
    }

    uint64_t size = atol(argv[1])<<20;
    uint64_t* buffer = new uint64_t[size/8];
    char* charbuffer = reinterpret_cast<char*>(buffer);
    for (unsigned i=0; i<size; ++i)
        charbuffer[i] = rand()%256;

    uint64_t count,duration;
    chrono::time_point<chrono::system_clock> startP,endP;
    {
        startP = chrono::system_clock::now();
        count = 0;
        for( unsigned k = 0; k < 10000; k++){
            // Tight unrolled loop with unsigned
            for (unsigned i=0; i<size/8; i+=4) {
                count += _mm_popcnt_u64(buffer[i]);
                count += _mm_popcnt_u64(buffer[i+1]);
                count += _mm_popcnt_u64(buffer[i+2]);
                count += _mm_popcnt_u64(buffer[i+3]);
            }
        }
        endP = chrono::system_clock::now();
        duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
        cout << "unsigned" << count << '' << (duration/1.0E9) << " sec "
             << (10000.0*size)/(duration) << " GB/s" << endl;
    }
    {
        startP = chrono::system_clock::now();
        count=0;
        for( unsigned k = 0; k < 10000; k++){
            // Tight unrolled loop with uint64_t
            for (uint64_t i=0;i<size/8;i+=4) {
                count += _mm_popcnt_u64(buffer[i]);
                count += _mm_popcnt_u64(buffer[i+1]);
                count += _mm_popcnt_u64(buffer[i+2]);
                count += _mm_popcnt_u64(buffer[i+3]);
            }
        }
        endP = chrono::system_clock::now();
        duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
        cout << "uint64_t"  << count << '' << (duration/1.0E9) << " sec "
             << (10000.0*size)/(duration) << " GB/s" << endl;
    }

    free(charbuffer);
}

As you see, we create a buffer of random data, with the size being x megabytes where x is read from the command line.

(如您所见,我们创建了一个随机数据缓冲区,大小为x兆字节,其中从命令行读取x 。)

Afterwards, we iterate over the buffer and use an unrolled version of the x86 popcount intrinsic to perform the popcount.

(之后,我们遍历缓冲区并使用x86 popcount内部版本的展开版本来执行popcount。)

To get a more precise result, we do the popcount 10,000 times.

(为了获得更精确的结果,我们将弹出次数进行了10,000次。)

We measure the times for the popcount.

(我们计算弹出次数的时间。)

In the upper case, the inner loop variable is unsigned , in the lower case, the inner loop variable is uint64_t .

(在大写情况下,内部循环变量是unsigned ,在小写情况下,内部循环变量是uint64_t 。)

I thought that this should make no difference, but the opposite is the case.

(我认为这应该没有什么区别,但情况恰恰相反。)

The (absolutely crazy) results ((绝对疯狂)结果)

I compile it like this (g++ version: Ubuntu 4.8.2-19ubuntu1):

(我这样编译(g ++版本:Ubuntu 4.8.2-19ubuntu1):)

g++ -O3 -march=native -std=c++11 test.cpp -o test

Here are the results on my Haswell Core i7-4770K CPU @ 3.50 GHz, running test 1 (so 1 MB random data):

(这是我的Haswell Core i7-4770K CPU @ 3.50 GHz,运行test 1 (因此有1 MB随机数据)的结果:)

  • unsigned 41959360000 0.401554 sec 26.113 GB/s

    (未签名41959360000 0.401554秒26.113 GB / s)

  • uint64_t 41959360000 0.759822 sec 13.8003 GB/s

    (uint64_t 41959360000 0.759822秒13.8003 GB / s)

As you see, the throughput of the uint64_t version is only half the one of the unsigned version!

(如您所见, uint64_t版本的吞吐量仅为 unsigned版本的吞吐量的一半 !)

The problem seems to be that different assembly gets generated, but why?

(问题似乎在于生成了不同的程序集,但是为什么呢?)

First, I thought of a compiler bug, so I tried clang++ (Ubuntu Clang version 3.4-1ubuntu3):

(首先,我想到了编译器错误,因此尝试了clang++ (Ubuntu Clang版本3.4-1ubuntu3):)

clang++ -O3 -march=native -std=c++11 teest.cpp -o test

Result: test 1

(结果: test 1)

  • unsigned 41959360000 0.398293 sec 26.3267 GB/s

    (无符号41959360000 0.398293秒26.3267 GB / s)

  • uint64_t 41959360000 0.680954 sec 15.3986 GB/s

    (uint64_t 41959360000 0.680954秒15.3986 GB / s)

So, it is almost the same result and is still strange.

(因此,它几乎是相同的结果,但仍然很奇怪。)

But now it gets super strange.

(但是现在变得超级奇怪。)

I replace the buffer size that was read from input with a constant 1 , so I change:

(我用常量1替换了从输入中读取的缓冲区大小,因此我进行了更改:)

uint64_t size = atol(argv[1]) << 20;

to

(至)

uint64_t size = 1 << 20;

Thus, the compiler now knows the buffer size at compile time.

(因此,编译器现在在编译时就知道缓冲区的大小。)

Maybe it can add some optimizations!

(也许它可以添加一些优化!)

Here are the numbers for g++ :

(这是g++的数字:)

  • unsigned 41959360000 0.509156 sec 20.5944 GB/s

    (未签名41959360000 0.509156秒20.5944 GB / s)

  • uint64_t 41959360000 0.508673 sec 20.6139 GB/s

    (uint64_t 41959360000 0.508673秒20.6139 GB / s)

Now, both versions are equally fast.

(现在,两个版本都同样快。)

However, the unsigned got even slower !

(但是, unsigned 变得更慢 !)

It dropped from 26 to 20 GB/s , thus replacing a non-constant by a constant value lead to a deoptimization .

(它从26 20 GB/s下降到20 GB/s ,因此用恒定值替换非恒定值会导致优化不足 。)

Seriously, I have no clue what is going on here!

(说真的,我不知道这是怎么回事!)

But now to clang++ with the new version:

(但是现在使用新版本的clang++ :)

  • unsigned 41959360000 0.677009 sec 15.4884 GB/s

    (未签名41959360000 0.677009秒15.4884 GB / s)

  • uint64_t 41959360000 0.676909 sec 15.4906 GB/s

    (uint64_t 41959360000 0.676909秒15.4906 GB / s)

Wait, what?

(等一下)

Now, both versions dropped to the slow number of 15 GB/s.

(现在,两个版本的速度都降低到了15 GB / s的缓慢速度 。)

Thus, replacing a non-constant by a constant value even lead to slow code in both cases for Clang!

(因此,在两种情况下,用常数替换非常数都会导致Clang!代码缓慢。)

I asked a colleague with an Ivy Bridge CPU to compile my benchmark.

(我要求具有Ivy Bridge CPU的同事来编译我的基准测试。)

He got similar results, so it does not seem to be Haswell.

(他得到了类似的结果,因此似乎不是Haswell。)

Because two compilers produce strange results here, it also does not seem to be a compiler bug.

(因为两个编译器在这里产生奇怪的结果,所以它似乎也不是编译器错误。)

We do not have an AMD CPU here, so we could only test with Intel.

(我们这里没有AMD CPU,因此只能在Intel上进行测试。)

More madness, please! (请更加疯狂!)

Take the first example (the one with atol(argv[1]) ) and put a static before the variable, ie:

(以第一个示例(带有atol(argv[1])示例)为例,然后在变量前放置一个static变量,即:)

static uint64_t size=atol(argv[1])<<20;

Here are my results in g++:

(这是我在g ++中的结果:)

  • unsigned 41959360000 0.396728 sec 26.4306 GB/s

    (无符号41959360000 0.396728秒26.4306 GB / s)

  • uint64_t 41959360000 0.509484 sec 20.5811 GB/s

    (uint64_t 41959360000 0.509484秒20.5811 GB / s)

Yay, yet another alternative .

(是的,另一种选择 。)

We still have the fast 26 GB/s with u32 , but we managed to get u64 at least from the 13 GB/s to the 20 GB/s version!

(我们仍然有快26 GB / s的u32 ,但我们设法u64从13 GB至少/ S到20 GB / s的版本!)

On my collegue's PC, the u64 version became even faster than the u32 version, yielding the fastest result of all.

(在我collegue的PC中, u64版本成为速度甚至超过了u32的版本,产生所有的最快的结果。)

Sadly, this only works for g++ , clang++ does not seem to care about static .

(可悲的是,这仅适用于g++clang++似乎并不关心static 。)

My question (我的问题)

Can you explain these results?

(您能解释这些结果吗?)

Especially:

(特别:)

  • How can there be such a difference between u32 and u64 ?

    (哪有之间的这种差异u32u64 ?)

  • How can replacing a non-constant by a constant buffer size trigger l

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Culprit: False Data Dependency (and the compiler isn't even aware of it)

(罪魁祸首:错误的数据依赖关系 (并且编译器甚至没有意识到))

On Sandy/Ivy Bridge and Haswell processors, the instruction:

(在Sandy / Ivy Bridge和Haswell处理器上,指令如下:)

popcnt  src, dest

appears to have a false dependency on the destination register dest .

(似乎对目标寄存器dest有错误的依赖性。)

Even though the instruction only writes to it, the instruction will wait until dest is ready before executing.

(即使指令仅写入指令,指令也会等到dest准备好后再执行。)

This false dependency is (now) documented by Intel as erratum HSD146 (Haswell) and SKL029 (Skylake)

(英特尔现已(现在)将这种错误依赖性记录为勘误表HSD146(Haswell)SKL029(Skylake))

Skylake fixed this for lzcnt and tzcnt .

(Skylake将其固定为lzcnttzcnt 。)
Cannon Lake (and Ice Lake) fixed this for popcnt .

(坎农湖(和冰湖) popcnt修复为popcnt 。)
bsf / bsr have a true output dependency: output unmodified for input=0.

(bsf / bsr具有真正的输出依赖性:输入= 0时输出未修改。)

(But no way to take advantage of that with intrinsics - only AMD documents it and compilers don't expose it.)

((但是没有办法利用内在函数来利用它 -只有AMD记录了它,编译器没有公开它。))

(Yes, these instructions all run on the same execution unit ).

((是的,这些指令都在同一执行单元上运行)。)


This dependency doesn't just hold up the 4 popcnt s from a single loop iteration.

(这种依赖关系不只是从单个循环迭代中容纳4个popcnt 。)

It can carry across loop iterations making it impossible for the processor to parallelize different loop iterations.

(它可以进行循环迭代,从而使处理器无法并行化不同的循环迭代。)

The unsigned vs. uint64_t and other tweaks don't directly affect the problem.

(unsigned vs. uint64_t和其他调整不会直接影响问题。)

But they influence the register allocator which assigns the registers to the variables.

(但是它们会影响寄存器分配器,后者将寄存器分配给变量。)

In your case, the speeds are a direct result of what is stuck to the (false) dependency chain depending on what the register allocator decided to do.

(在您的情况下,速度是卡在(假)依赖链上的直接结果,具体取决于寄存器分配器决定执行的操作。)

  • 13 GB/s has a chain: popcnt - add - popcnt - popcnt → next iteration

    (13 GB / s有一条链: popcnt add popcnt - popcnt →下一次迭代)

  • 15 GB/s has a chain: popcnt - add - popcnt - add → next iteration

    (15 GB / s有一条链: popcnt add popcnt add →下一个迭代)

  • 20 GB/s has a chain: popcnt - popcnt → next iteration

    (20 GB / s有一条链: popcnt - popcnt →下一次迭代)

  • 26 GB/s has a chain: popcnt - popcnt → next iteration

    (26 GB / s有一条链: popcnt - popcnt →下一次迭代)

The difference between 20 GB/s and 26 GB/s seems to be a minor artifact of the indirect addressing.

(20 GB / s和26 GB / s之间的差异似乎只是间接寻址的一个小问题。)

Either way, the processor starts to hit other bottlenecks once you reach this speed.

(无论哪种方式,一旦达到此速度,处理器就会开始遇到其他瓶颈。)


To test this, I used inline assembly to bypass the compiler and get exactly the assembly I want.

(为了测试这一点,我使用了内联程序集来绕过编译器并获得我想要的程序集。)

I also split up the count variable to break all other dependencies that might mess with the benchmarks.

(我还拆分了count变量,以打破可能与基准混乱的所有其他依赖项。)

Here are the results:

(结果如下:)

Sandy Bridge Xeon @ 3.5 GHz: (full test code can be found at the bottom)

(Sandy Bridge Xeon @ 3.5 GHz :(完整的测试代码可在底部找到))

  • GCC 4.6.3: g++ popcnt.cpp -std=c++0x -O3 -save-temps -march=native

    (GCC 4.6.3: g++ popcnt.cpp -std=c++0x -O3 -save-temps -march=native)

  • Ubuntu 12

    (Ubuntu 12)

Different Registers: 18.6195 GB/s

(不同的寄存器: 18.6195 GB / s)

.L4:
    movq    (%rbx,%rax,8), %r8
    movq    8(%rbx,%rax,8), %r9
    movq    16(%rbx,%rax,8), %r10
    movq    24(%rbx,%rax,8), %r11
    addq    $4, %rax

    popcnt %r8, %r8
    add    %r8, %rdx
    popcnt %r9, %r9
    add    %r9, %rcx
    popcnt %r10, %r10
    add    %r10, %rdi
    popcnt %r11, %r11
    add    %r11, %rsi

    cmpq    $131072, %rax
    jne .L4

Same Register: 8.49272 GB/s

(相同寄存器: 8.49272 GB / s)

.L9:
    movq    (%rbx,%rdx,8), %r9
    movq    8(%rbx,%rdx,8), %r10
    movq    16(%rbx,%rdx,8), %r11
    movq    24(%rbx,%rdx,8), %rbp
    addq    $4, %rdx

    # This time reuse "rax" for all the popcnts.
    popcnt %r9, %rax
    add    %rax, %rcx
    popcnt %r10, %rax
    add    %rax, %rsi
    popcnt %r11, %rax
    add    %rax, %r8
    popcnt %rbp, %rax
    add    %rax, %rdi

    cmpq    $131072, %rdx
    jne .L9

Same Register with broken chain: 17.8869 GB/s

(链断裂的同一个寄存器: 17.8869 GB / s)

.L14:
    movq    (%rbx,%rdx,8), %r9
    movq    8(%rbx,%rdx,8), %r10
    movq    16(%rbx,%rdx,8), %r11
    movq    24(%rbx,%rdx,8), %rbp
    addq    $4, %rdx

    # Reuse "rax" for all the popcnts.
    xor    %rax, %rax    # Break the cross-iteration dependency by zeroing "rax".
    popcnt %r9, %rax
    add    %rax, %rcx
    popcnt %r10, %rax
    add    %rax, %rsi
    popcnt %r11, %rax
    add    %rax, %r8
    popcnt %rbp, %rax
    add    %rax, %rdi

    cmpq    $131072, %rdx
    jne .L14

So what went wrong with the compiler?

(那么编译器出了什么问题?)

It seems that neither GCC nor Visual Studio are aware that popcnt has such a false dependency.

(似乎GCC和Visual Studio都不知道popcnt具有这种错误的依赖性。)

Nevertheless, these false dependencies aren't uncommon.

(但是,这些错误的依赖关系并不少见。)

It's just a matter of whether the compiler is aware of it.

(只是编译器是否知道它。)

popcnt isn't exactly the most used instruction.

(popcnt并不是最常用的指令。)

So it's not really a surprise that a major compiler could miss something like this.

(因此,主要的编译器可能会错过这样的内容,这并不奇怪。)

There also appears to be no documentation anywhere that mentions this problem.

(似乎也没有任何文档提到此问题。)

If Intel doesn't disclose it, then nobody outside will know until someone runs into it by chance.

(如果英特尔不公开,那么除非有人偶然碰到它,否则外界不会知道。)

( Update: As of version 4.9.2 , GCC is aware of this false-dependency and generates code to compensate it when optimizations are enabled. Major compilers from other vendors, including Clang, MSVC, and even Intel's own ICC are not yet aware of this microarchitectural erratum and will not emit code that compensates for it.)

(( 更新: 从4.9.2版开始 ,GCC意识到了这种虚假依赖关系,并在启用优化后生成了代码来对其进行补偿。其他厂商的主要编译器,包括Clang,MSVC甚至是英特尔自己的ICC都尚未意识到此微体系结构勘误表,并且不会发出补偿它的代码。))

Why does the CPU have such a false dependency?

(为什么CPU具有如此错误的依赖性?)

We can speculate: it runs on the same execution unit as bsf / bsr which do have an output dependency.

(我们可以推测:它运行相同的执行单元作为bsf / bsr确实有一个输出的依赖。)

( How is POPCNT implemented in hardware? ).

(( 如何在硬件中实现POPCNT? )。)

For those instructions, Intel documents the integer result for input=0 as "undefined" (with ZF=1), but Intel hardware actually gives a stronger guarantee to avoid breaking old software: output unmodified.

(对于这些指令,Intel将input = 0的整数结果记录为“未定义”(ZF = 1),但是Intel硬件实际上为避免破坏旧软件提供了更强的保证:输出未修改。)

AMD documents this behaviour.

(AMD记录了这种行为。)

Presumably it was somehow inconvenient to make some uops for this execution unit dependent on the output but others not.

(可能不方便地为此执行单元的输出依赖于输出,而另一些则不然。)

AMD processors do not appear to have this false d


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...