See also Set all bits in CPU register to 1 efficiently which covers AVX, AVX2, and AVX512 zmm and k (mask) registers.
You obviously didn't even look at the asm output, which is trivial to do:
#include <immintrin.h>
__m256i all_ones(void) { return _mm256_set1_epi64x(-1); }
compiles to with GCC and clang with any -march
that includes AVX2
vpcmpeqd ymm0, ymm0, ymm0
ret
To get a __m256
(not __m256i
) you can just cast the result:
__m256 nans = _mm256_castsi256_ps( _mm256_set1_epi32(-1) );
Without AVX2, a possible option is vcmptrueps dst, ymm0,ymm0
preferably with a cold register for the input to mitigate the false dependency.
Recent clang (5.0 and later) does xor-zero a vector then vcmpps
with a TRUE predicate if AVX2 isn't available. Older clang makes a 128bit all-ones with vpcmpeqd xmm
and uses vinsertf128
. GCC loads from memory, even modern GCC 10.1 with -march=sandybridge
.
As described by the vector section of Agner Fog's optimizing assembly guide, generating constants on the fly this way is cheap. It still takes a vector execution unit to generate the all-ones (unlike _mm_setzero
), but it's better than any possible two-instruction sequence, and usually better than a load. See also the x86 tag wiki.
Compilers don't like to generate more complex constants on the fly, even ones that could be generated from all-ones with a simple shift. Even if you try, by writing __m128i float_signbit_mask = _mm_srli_epi32(_mm_set1_epi16(-1), 1)
, compilers typically do constant-propagation and put the vector in memory. This lets them fold it into a memory operand when used later in cases where there's no loop to hoist the constant out of.
And I can't seem to find a simple bitwise NOT operation in AVX?
You do that by XORing with all-ones with vxorps
(_mm256_xor_ps
). Unfortunately SSE/AVX don't provide a way to do a NOT without a vector constant.
FP vs Integer instructions and bypass delay
Intel CPUs (at least Skylake) have a weird effect where the extra bypass latency between SIMD-integer and SIMD-FP still happens long after the uop producing the register has executed. e.g. vmulps ymm1, ymm2, ymm0
could have an extra cycle of latency for the ymm2
-> ymm1
critical path if ymm0
was produced by vpcmpeqd
. And this lasts until the next context switch restores FP state if you don't otherwise overwrite ymm0
.
This is not a problem for bitwise instructions like vxorps
(even though the mnemonic has ps
, it doesn't have bypass delay from FP or vec-int domains on Skylake, IIRC).
So normally it's safe to create a set1(-1)
constant with an integer instruction because that's a NaN and you wouldn't normally use it with FP math instructions like mul or add.