There is a way to do this and it's not so difficult.
The main things you need to know about are the function calling conventions, the object format, and function name mangling.
Function Calling Conventions.
In 32-bit mode Windows and Unix (i.e. Linux, BSD, Mac OS X,...) use the same function calling conventions.
In 64-bit mode Windows and Unix use different function calling conventions. In order for your object file compiled with GCC to work with MSVC in 64 bit mode you must use the Windows function calling convention. To do this with gcc you can use mabi=ms
for example:
g++ -c -mabi=ms -mavx -fopenmp -O3 foo.cpp
The Object File Format
The object file format for Linux is ELF and for windows it's COFF/PE. In order to use an object compiled with GCC in MSVC it needs to be converted from ELF to COFF. To to this you need an object file converter. I use Agner Fog's objconv. For example to convert from ELF64 to 64-bit COFF64 (PE32+) do:
objconv -fcoff64 foo.o foo.obj
Function Name Mangling
Due to function overloading C++ mangles the function names. GCC and MSVC do this differently. To get around this you can proceed the function name with external "C"
.
More details of the calling conventions, object format, and function name mangling can be found in Agner Fog's manual calling conventions.
Below is a module I compiled with GCC and then used in MSVC (because GCC optimized it better). I compiled it with -mabi=ms
, converted it to COFF64 with objconv
and then linked it into Visual Studio which ran flawlessly.
#include <immintrin.h>
extern "C" void inner(const int n, const float *a, const float *b, float *c, const int stridea, const int strideb, const int stridec) {
const int vec_size = 8;
__m256 tmp0, tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7;
tmp0 = _mm256_loadu_ps(&c[0*vec_size]);
tmp1 = _mm256_loadu_ps(&c[1*vec_size]);
tmp2 = _mm256_loadu_ps(&c[2*vec_size]);
tmp3 = _mm256_loadu_ps(&c[3*vec_size]);
tmp4 = _mm256_loadu_ps(&c[4*vec_size]);
tmp5 = _mm256_loadu_ps(&c[5*vec_size]);
tmp6 = _mm256_loadu_ps(&c[6*vec_size]);
tmp7 = _mm256_loadu_ps(&c[7*vec_size]);
for(int i=0; i<n; i++) {
__m256 areg0 = _mm256_set1_ps(a[i]);
__m256 breg0 = _mm256_loadu_ps(&b[vec_size*(8*i + 0)]);
tmp0 = _mm256_add_ps(_mm256_mul_ps(areg0,breg0), tmp0);
__m256 breg1 = _mm256_loadu_ps(&b[vec_size*(8*i + 1)]);
tmp1 = _mm256_add_ps(_mm256_mul_ps(areg0,breg1), tmp1);
__m256 breg2 = _mm256_loadu_ps(&b[vec_size*(8*i + 2)]);
tmp2 = _mm256_add_ps(_mm256_mul_ps(areg0,breg2), tmp2);
__m256 breg3 = _mm256_loadu_ps(&b[vec_size*(8*i + 3)]);
tmp3 = _mm256_add_ps(_mm256_mul_ps(areg0,breg3), tmp3);
__m256 breg4 = _mm256_loadu_ps(&b[vec_size*(8*i + 4)]);
tmp4 = _mm256_add_ps(_mm256_mul_ps(areg0,breg4), tmp4);
__m256 breg5 = _mm256_loadu_ps(&b[vec_size*(8*i + 5)]);
tmp5 = _mm256_add_ps(_mm256_mul_ps(areg0,breg5), tmp5);
__m256 breg6 = _mm256_loadu_ps(&b[vec_size*(8*i + 6)]);
tmp6 = _mm256_add_ps(_mm256_mul_ps(areg0,breg6), tmp6);
__m256 breg7 = _mm256_loadu_ps(&b[vec_size*(8*i + 7)]);
tmp7 = _mm256_add_ps(_mm256_mul_ps(areg0,breg7), tmp7);
}
_mm256_storeu_ps(&c[0*vec_size], tmp0);
_mm256_storeu_ps(&c[1*vec_size], tmp1);
_mm256_storeu_ps(&c[2*vec_size], tmp2);
_mm256_storeu_ps(&c[3*vec_size], tmp3);
_mm256_storeu_ps(&c[4*vec_size], tmp4);
_mm256_storeu_ps(&c[5*vec_size], tmp5);
_mm256_storeu_ps(&c[6*vec_size], tmp6);
_mm256_storeu_ps(&c[7*vec_size], tmp7);
}