This is a bug in the way Apple's LLVM-backed GCC (llvm-gcc
) transforms OpenMP regions and handles calls to the built-ins inside them. The problem can be diagnosed by examining the intermediate tree dumps (obtainable by passing -fdump-tree-all
argument to gcc
). Without OpenMP enabled the following final code representation is generated (from the test.c.016t.fap
):
main (argc, argv)
{
D.6544 = __builtin_object_size (temp, 0);
D.6545 = __builtin_object_size (temp, 0);
D.6547 = __builtin___memcpy_chk (temp, D.6546, 10, D.6545);
D.6550 = __builtin_ia32_shufpd (v_a, v_a, 1);
}
This is a C-like representation of how the compiler sees the code internally after all transformations. This is what is then gets turned into assembly instructions. (only those lines that refer to the built-ins are shown here)
With OpenMP enabled the parallel region is extracted into own function, main.omp_fn.0
:
main.omp_fn.0 (.omp_data_i)
{
void * (*<T4f6>) (void *, const <unnamed type> *, long unsigned int, long unsigned int) __builtin___memcpy_chk.21;
long unsigned int (*<T4f5>) (const <unnamed type> *, int) __builtin_object_size.20;
vector double (*<T6b5>) (vector double, vector double, int) __builtin_ia32_shufpd.23;
long unsigned int (*<T4f5>) (const <unnamed type> *, int) __builtin_object_size.19;
__builtin_object_size.19 = __builtin_object_size;
D.6587 = __builtin_object_size.19 (D.6603, 0);
__builtin_ia32_shufpd.23 = __builtin_ia32_shufpd;
D.6593 = __builtin_ia32_shufpd.23 (v_a, v_a, 1);
__builtin_object_size.20 = __builtin_object_size;
D.6588 = __builtin_object_size.20 (D.6605, 0);
__builtin___memcpy_chk.21 = __builtin___memcpy_chk;
D.6590 = __builtin___memcpy_chk.21 (D.6609, D.6589, 10, D.6588);
}
Again I have only left the code that refers to the builtins. What is apparent (but the reason for that is not immediately apparent to me) is that the OpenMP code trasnformer really insists on calling all the built-ins through function pointers. These pointer asignments:
__builtin_object_size.19 = __builtin_object_size;
__builtin_ia32_shufpd.23 = __builtin_ia32_shufpd;
__builtin_object_size.20 = __builtin_object_size;
__builtin___memcpy_chk.21 = __builtin___memcpy_chk;
generate external references to symbols which are not really symbols but rather names that get special treatment by the compiler. The linker then tries to resolve them but is unable to find any of the __builtin_*
names in any of the object files that the code is linked against. This is also observable in the assembly code that one can obtain by passing -S
to gcc
:
LBB2_1:
movapd -48(%rbp), %xmm0
movl $1, %eax
movaps %xmm0, -80(%rbp)
movaps -80(%rbp), %xmm1
movl %eax, %edi
callq ___builtin_ia32_shufpd
movapd %xmm0, -32(%rbp)
This basically is a function call that takes 3 arguments: one integer in %eax
and two XMM arguments in %xmm0
and %xmm1
, with the result being returned in %xmm0
(as per the SysV AMD64 ABI function calling convention). In contrast, the code generated without -fopenmp
is an instruction-level expansion of the intrinsic as it is supposed to happen:
LBB1_3:
movapd -64(%rbp), %xmm0
shufpd $1, %xmm0, %xmm0
movapd %xmm0, -80(%rbp)
What happens when you pass -D_FORTIFY_SOURCE=0
is that memcpy
is not replaced by the "fortified" checking version and a regular call to memcpy
is used instead. This eliminates the references to object_size
and __memcpy_chk
but cannot remove the call to the ia32_shufpd
built-in.
This is obviously a compiler bug. If you really really really must use Apple's GCC to compile the code, then an interim solution would be to move the offending code to an external function as the bug apparently only affects code that gets extracted from parallel
regions:
void func(char *temp, char *argv0)
{
__m128d v_a, v_ar;
memcpy(temp, argv0, 10);
v_ar = _mm_shuffle_pd(v_a, v_a, _MM_SHUFFLE2 (0,1));
}
int main(int argc, char *argv[])
{
char *temp;
#pragma omp parallel
{
func(temp, argv[0]);
}
}
The overhead of one additional function call is neglegible compared to the overhead of entering and exiting the parallel
region. You can use OpenMP pragmas inside func
- they will work because of the dynamic scoping of the parallel
region.
May be Apple would provide a fixed compiler in the future, may they won't, given their commitment to replacing GCC with Clang.