Vectorization in GCC is enabled at -O3
. That's why at -O0
, you see only the ordinary scalar SSE2 instructions (movsd
, addsd
, etc). Using GCC 4.6.1 and your second example:
#define N 10000
#define NTIMES 100000
double a[N] __attribute__ ((aligned (16)));
double b[N] __attribute__ ((aligned (16)));
double c[N] __attribute__ ((aligned (16)));
double r[N] __attribute__ ((aligned (16)));
int
main (void)
{
int i, times;
for (times = 0; times < NTIMES; times++)
{
for (i = 0; i < N; ++i)
r[i] = (a[i] + b[i]) * c[i];
}
return 0;
}
and compiling with gcc -S -O3 -msse2 sse.c
produces for the inner loop the following instructions, which is pretty good:
.L3:
movapd a(%eax), %xmm0
addpd b(%eax), %xmm0
mulpd c(%eax), %xmm0
movapd %xmm0, r(%eax)
addl $16, %eax
cmpl $80000, %eax
jne .L3
As you can see, with the vectorization enabled GCC emits code to perform two loop iterations in parallel. It can be improved, though - this code uses the lower 128 bits of the SSE registers, but it can use the full the 256-bit YMM registers, by enabling the AVX encoding of SSE instructions (if available on the machine). So, compiling the same program with gcc -S -O3 -msse2 -mavx sse.c
gives for the inner loop:
.L3:
vmovapd a(%eax), %ymm0
vaddpd b(%eax), %ymm0, %ymm0
vmulpd c(%eax), %ymm0, %ymm0
vmovapd %ymm0, r(%eax)
addl $32, %eax
cmpl $80000, %eax
jne .L3
Note that v
in front of each instruction and that instructions use the 256-bit YMM registers, four iterations of the original loop are executed in parallel.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…