c++ - How many clock cycles does cost AVX/SSE exponentiation on modern x86_64 CPU?

Question

Welcome To Ask or Share your Answers For Others

c++ - How many clock cycles does cost AVX/SSE exponentiation on modern x86_64 CPU?

1 Reply

深蓝 · Answer 1 · 2021-10-23T21:38:06+0000

The x86 SIMD instruction set (i.e. not x87), at least up to AVX2, does not include SIMD exp, log, or pow with the exception of pow(x,0.5) which is the square root.

There are SIMD math libraries however which are built from SIMD instructions which have these functions (among others). Intel's SVML includes:

__m256 _mm256_exp_ps(__m256)
__m256 _mm256_log_ps(__m256)
__m256 _mm256_pow_ps(__m256, __m256)

which Intel disingenuously calls intrinsics when they are in fact functions with several instructions. SVML is closed source and expensive. However, by searching for svml after installing the Intel OpenCL runtime I found some svml files in the OpenCL directories so I think you can get SVML indirectly through Intel's OpenCL runtime.

AMD also provides a SIMD math library called LibM, which is closed source but free, which also has its own SIMD math functions:

__m128 amd_vrs4_expf(__m128)
__m128 amd_vrs4_logf(__m128)
__m128 amd_vrs4_powf(__m128, __m128)

Agner Fog's Vector Class Library provides an interface to SVML and LibM. See the file vectormath_lib.h. From this you can figure out the corresponding functions from SVML and LibM.

Agner also provides his own code for these functions which he claims is competitive with the proprietary Intel and AMD version. For Agner's version of the functions look in vectormath_exp.h e.g. look at exp_f, log_f, and pow_template_f and then look at the generated assembly.

You can use SVML, LibM, and Agner's own functions to time the exp and log functions. However, you should know that SVML and LibM don't play well on the others hardware. AMD for example is optimized for FMA4 which Intel does not have (but Intel original planned to have FMA4 and then changed to FMA3 suddenly after AMD had already planned for FMA4). Intel appears to do something ummm...well I suggest you read about it.

So if you time SVML or LibM on AMD or Intel processors respectively you will likely get very different results in performance (unless you manage to replace Intel's CPU dispatch function). Unlike GPUs the x86 instructions set is publicly available so you can build your own exp and log functions and that is what Agner has done.

Update

Glibc 2.22 (which should come out soon) has a vector math library called libmvec. Apparently it's enabled starting at -O1 along with -ffast-math and -fopenmp. I'm not sure why fast-math and OpenMP are necessary (particularly in the example below as associative math is not necessary) but it's great to finally have a SIMD math library in the GNU C standard library.

//gcc ./cos.c -O1 -fopenmp -ffast-math -lm -mavx2 
#include <math.h>

int N = 3200;
double b[3200];
double a[3200];

int main (void)
{
  int i;

  #pragma omp simd
  for (i = 0; i < N; i += 1)
  {
    b[i] = cos (a[i]);
  }

  return (0);
}

Categories

c++ - How many clock cycles does cost AVX/SSE exponentiation on modern x86_64 CPU?

c++ - How many clock cycles does cost AVX/SSE exponentiation on modern x86_64 CPU?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags