Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
621 views
in Technique[技术] by (71.8m points)

c - Building backward compatible binaries with newer CPU instructions support

What is the best way to implement multiple versions of the same function that uses a specific CPU instructions if available (tested at run time), or falls back to a slower implementation if not?

For example, x86 BMI2 provides a very useful PDEP instruction. How would I write a C code such that it tests BMI2 availability of the executing CPU on startup, and uses one of the two implementations -- one that uses _pdep_u64 call (available with -mbmi2), and another that does bit manipulation "by hand" using C code. Are there any built-in support for such cases? How would I make GCC compile for older arch while providing access to the newer intrinsic? I suspect execution is faster if the function is invoked via a global function pointer, rather than an if/else every time?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can declare a function pointer and point it to the correct version at program startup by calling cpuid to determine the current architecture

But it's better to utilize support from many modern compilers. Intel's ICC has automatic function dispatching to select the optimized version for each architecture long ago. I don't know the details but looks like it only applies to Intel's libraries. Besides it only dispatches to the efficient version on Intel CPUs, hence would be unfair to other manufacturers. There are many patches and workarounds for that in Agner`s CPU blog

Later a feature called Function Multiversioning was introduced in GCC 4.8. It adds the target attribute that you'll declare on each version of your function

__attribute__ ((target ("sse4.2")))
int foo() { return 1; }

__attribute__ ((target ("arch=atom")))
int foo() { return 2; }

int main() {
    int (*p)() = &foo;
    return foo() + p();
}

That duplicates a lot of code and is cumbersome so GCC 6 added target_clones that tells GCC to compile a function to multiple clones. For example __attribute__((target_clones("avx2","arch=atom","default"))) void foo() {} will create 3 different foo versions. More information about them can be found in GCC's documentation about function attribute

The syntax was then adopted by Clang and ICC. Performance can even be better than a global function pointer because the function symbols can be resolved at process loading time instead of runtime. It's one of the reasons Intel's Clear Linux runs so fast. ICC may also create multiple versions of a single loop during auto-vectorization

Here's an example from The one with multi-versioning (Part II) along with its demo which is about popcnt but you get the idea

__attribute__((target_clones("popcnt","default")))
int runPopcount64_builtin_multiarch_loop(const uint8_t* bitfield, int64_t size, int repeat) {
    int res = 0;
    const uint64_t* data = (const uint64_t*)bitfield;

    for (int r=0; r<repeat; r++)
    for (int i=0; i<size/8; i++) {
        res += popcount64_builtin_multiarch_loop(data[i]);
    }

    return res;
}

Note that PDEP and PEXT are very slow on current AMD CPUs so they should only be enabled on Intel


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...