TL:DR: Don't use the _mm256_zeroupper()
intrinsic manually, compilers understand SSE/AVX transition stuff and emit vzeroupper
where needed for you. (Including when auto-vectorizing or expanding memcpy/memset/whatever with YMM regs.)
"Some Intel processors" being all except Xeon Phi.
Xeon Phi (KNL / KNM) don't have a state optimized for running legacy SSE instructions because they're purely designed to run AVX-512. Legacy SSE instructions probably always have false dependencies merging into the destination.
On mainstream CPUs with AVX or later, there are two different mechanisms: saving dirty uppers (SnB through Haswell, and Ice Lake) or false dependencies (Skylake). See Why is this SSE code 6 times slower without VZEROUPPER on Skylake? the two different styles of SSE/AVX penalty
Related Q&As about the effects of asm vzeroupper
(in the compiler-generated machine code):
Intrinsics in C or C++ source
You should pretty much never use _mm256_zeroupper()
in C/C++ source code. Things have settled on having the compiler insert a vzeroupper
instruction automatically where it might be needed, which is pretty much the only sensible way for compilers to be able to optimize functions containing intrinsics and still reliably avoid transition penalties. (Especially when considering inlining). All the major compilers can auto-vectorize and/or inline memcpy/memset/array init with YMM registers, so need to keep track of using vzeroupper
after that.
The convention is to have the CPU in clean-uppers state when calling or returning, except when calling functions that take __m256
/ __m256i/d
args by value (in registers or at all), or when returning such a value. The target function (callee or caller) inherently must be AVX-aware and expecting a dirty-upper state because a full YMM register is in-use as part of the calling convention.
x86-64 System V passes vectors in vector regs. Windows vectorcall
does, too, but the original Windows x64 convention (now named "fastcall" to distinguish from "vectorcall") passes vectors by value in memory via hidden pointer. (This optimizes for variadic functions by making every arg always fit in an 8-byte slot.) IDK how compilers compiling Windows non-vectorcall calls handle this, whether they assume the function probably looks at its args or at least is still responsible for using a vzeroupper
at some point even if it doesn't. Probably yes, but if you're writing your own code-gen back-end, or hand-written asm, have a look at what some compilers you care about actually do if this case is relevant for you.
Some compilers optimize by also omitting vzeroupper
before returning from a function that took vector args, because clearly the caller is AVX-aware. And crucially, apparently compilers shouldn't expect that calling a function like void foo(__m256i)
will leave the CPU in clean-upper state, so the callee does still need a vzeroupper
after such a function, before call printf
or whatever.
Compilers have options to control vzeroupper
usage
For example, GCC -mno-vzeroupper
/ clang -mllvm -x86-use-vzeroupper=0
. (The default is -mvzeroupper
to do the behaviour described above, using when it might be needed.)
This is implied by -march=knl
(Knight's Landing) because it's not needed and very slow on Xeon Phi CPUs (thus should actively be avoided).
Or you might possibly want it if you build libc (and any other libraries you use) with -mavx -mno-veroupper
. glibc has some hand-written asm for functions like strlen, but most of those have AVX2 versions. So as long as you're not on an AVX1-only CPU, legacy-SSE versions of string functions might not get used at all.
For MSVC, you should definitely prefer using -arch:AVX
when compiling code that uses AVX intrinsics. I think some versions of MSVC could generate code that caused transition penalties if you mixed __m128
and __m256
without /arch:AVX
. But beware that that option will make even 128-bit intrinsics like _mm_add_ps
use the AVX encoding (vaddps
) instead of legacy SSE (addps
), though, and will let the compiler auto-vectorize with AVX. There is undocumented switch /d2vzeroupper
to enable automatic vzeroupper
generation (default), /d2vzeroupper-
disables it - see What is the /d2vzeroupper MSVC compiler optimization flag doing?