compiler optimization - How to vectorize with gcc?

Question

Welcome To Ask or Share your Answers For Others

compiler optimization - How to vectorize with gcc?

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:44:38+0000

The original page offers details on getting gcc to automatically vectorize loops, including a few examples:

http://gcc.gnu.org/projects/tree-ssa/vectorization.html

While the examples are great, it turns out the syntax for calling those options with latest GCC seems to have changed a bit, see now:

https://gcc.gnu.org/onlinedocs/gcc/Developer-Options.html#index-fopt-info

In summary, the following options will work for x86 chips with SSE2, giving a log of loops that have been vectorized:

gcc -O2 -ftree-vectorize -msse2 -mfpmath=sse -ftree-vectorizer-verbose=5

Note that -msse is also a possibility, but it will only vectorize loops using floats, not doubles or ints. (SSE2 is baseline for x86-64. For 32-bit code use -mfpmath=sse as well. That's the default for 64-bit but not 32-bit.)

Modern versions of GCC enable -ftree-vectorize at -O3 so just use that in GCC4.x and later:

gcc   -O3 -msse2 -mfpmath=sse  -ftree-vectorizer-verbose=5

(Clang enables auto-vectorization at -O2. ICC defaults to optimization enabled + fast-math.)

Most of the following was written by Peter Cordes, who could have just written a new answer. Over time, as compilers change, options and compiler output will change. I am not entirely sure whether it is worth tracking it in great detail here. Comments? -- Author

To also use instruction set extensions supported by the hardware you're compiling on, and tune for it, use -march=native.

Reduction loops (like sum of an array) will need OpenMP or -ffast-math to treat FP math as associative and vectorize. Example on the Godbolt compiler explorer with -O3 -march=native -ffast-math including a reduction (array sum) which is scalar without -ffast-math. (Well, GCC8 and later do a SIMD load and then unpack it to scalar elements, which is pointless vs. simple unrolling. The loop bottlenecks on the latency of the one addss dependency chain.)

Sometimes you don't need -ffast-math, just -fno-math-errno can help gcc inline math functions and vectorize something involving sqrt and/or rint / nearbyint.

Other useful options include -flto (link-time optimization for cross-file inlining, constant propagation, etc) and / or profile-guided optimization with -fprofile-generate / test run(s) with realistic input(s) /-fprofile-use. PGO enables loop unrolling for "hot" loops; in modern GCC that's off by default even at -O3.

Categories

compiler optimization - How to vectorize with gcc?

compiler optimization - How to vectorize with gcc?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags