In theory alignment should not matter on Intel processors since Nehalem. Therefore, your compiler should be able to produce code in which a pointer being aligned or not is not an issue.
Unaligned load/store instructions have the same performance on Intel processors since Nehalem. However, until AVX arrived with Sandy Bridge unaligned loads could not be folded with another operation for micro-op fusion.
Additionally, even before AVX to avoid the penalty of cache line splits having 16 byte aligned memory could still be helpful so it would still be reasonable for a compiler to add code until the pointer is 16 byte aligned.
Since AVX there is no advantage to using aligned load/store instructions anymore and there is no reason for a compiler to add code to make a pointer 16 byte or 32 byte aligned..
However, there is till a reason to use aligned memory to avoid cache-line splits with AVX. Therefore, it would would be reasonable for a compiler to add code to make the pointer 32 byte aligned even if it still used an unaligned load instruction.
So in practice some compilers produce much simpler code when they are told to assume that a pointer is aligned.
I'm not aware of a method to tell MSVC that a pointer is aligned. With GCC and Clang (since 3.6) you can use a built in __builtin_assume_aligned
. With ICC and also GCC you can use #pragma omp simd aligned
. With ICC you can also use __assume_aligned
.
For example with GCC compiling this simple loop
void foo(float * __restrict a, float * __restrict b, int n)
{
//a = (float*)__builtin_assume_aligned (a, 16);
//b = (float*)__builtin_assume_aligned (b, 16);
for(int i=0; i<(n & (-4)); i++) {
b[i] = 3.14159f*a[i];
}
}
with gcc -O3 -march=nehalem -S test.c
and then wc test.s
gives 160 lines. Whereas if use __builtin_assume_aligned
then wc test.s
gives only 45 lines. When I did this with in both cases clang return 110 lines.
So on clang informing the compiler the arrays were aligned made no difference (in this case) but with GCC it did. Counting lines of code is not a sufficient metric to gauge performance but I'm not going to post all the assembly here I just want to illustrate that your compiler may produce very different code when it is told the arrays are aligned.
Of course, the additional overhead that GCC has for not assuming the arrays are aligned may make no difference in practice. You have to test and see.
In any case, if you want to get the most most from SIMD I would not rely on the compiler to do it correctly (especially with MSVC). Your example of matrix*vector
is a poor one in general (but maybe not for some special cases) since it's memory bandwidth bound. But if you choose matrix*matrix
no compiler is going to optimize that well without a lot of help which does not conform to the C++ standard. In these cases you will need intrinsics/built-ins/assembly in which you have explicit control of the alignment anyway.
Edit:
The assembly from GCC contains a lot of extraneous lines which are not part of the text segment. Doing gcc -O3 -march=nehalem -S test.c
and then using objdump -d
and counting the lines in the text (code) segment gives 108
lines without using __builtin_assume_aligned
and only 16
lines with it. This shows more clearly that GCC produces very different code when it assumes the arrays are aligned.
Edit:
I went ahead and tested the foo
function above in MSVC 2013. It produces unaligned loads and the code is much shorter than GCC (I only show the main loop here):
$LL3@foo:
movsxd rax, r9d
vmulps xmm1, xmm0, XMMWORD PTR [r10+rax*4]
vmovups XMMWORD PTR [r11+rax*4], xmm1
lea eax, DWORD PTR [r9+4]
add r9d, 8
movsxd rcx, eax
vmulps xmm1, xmm0, XMMWORD PTR [r10+rcx*4]
vmovups XMMWORD PTR [r11+rcx*4], xmm1
cmp r9d, edx
jl SHORT $LL3@foo
This should be fine on processors since Nehalem (late 2008). But MSVC still has cleanup code for arrays that are not a multiple of four even thought I told the compiler that it was a multiple of four ((n & (-4)
). At least GCC gets that right.
Since AVX can fold unalinged loads I checked GCC with AVX to see if the code was the same.
void foo(float * __restrict a, float * __restrict b, int n)
{
//a = (float*)__builtin_assume_aligned (a, 32);
//b = (float*)__builtin_assume_aligned (b, 32);
for(int i=0; i<(n & (-8)); i++) {
b[i] = 3.14159f*a[i];
}
}
without __builtin_assume_aligned
GCC produces 168 lines of assembly and with it it only produces 17 lines.