In SSE/AVX programming, selective copying from one vector to another based on a mask is called a blend. SSE4.1 added instructions like PBLENDVB xmm1, xmm2/m128, <XMM0>
, where the implicit operand XMM0 controls which bytes of the src overwrite corresponding bytes in the dst. (Without SSE4.1, you'd usually AND and ANDNOT the mask onto two vectors, and OR that together; the xor trick has less instruction-level parallelism, and probably requires at least as many MOV instructions to copy registers.)
There's also an immediate blend instruction, pblendw
, where the mask is an 8-bit immediate instead of a register. And there are 32-bit and 64-bit immediate blends (blendps
, blendpd
, vpblendd
) and variable blends (blendvps
, blendvpd
).
IDK if other SIMD instruction sets (NEON, AltiVec, whatever MIPS calls theirs, etc.) also call them "blends" or not.
SSE/AVX (or x86 integer instructions) don't provide anything better than the usual bitwise XOR/AND for doing bitwise (instead of element-wise) blends until AVX512F.
AVX512F can do the bitwise version of this (or any other bitwise ternary function) with a single vpternlogd
or vpternlogq
instruction. (The only difference between d and q element sizes is if you use a mask register for merge-masking or zero-masking the destination, but that didn't stop Intel from making separate intrinsics even for the no-mask case:
__m512i _mm512_ternarylogic_epi32 (__m512i a, __m512i b, __m512i c, int imm8)
and the equivalent ..._epi64 version.
The imm8
immediate byte is a truth table. Every bit of the destination is determined independently, from the corresponding bits of a, b and c by using them as a 3-bit index into the truth table. i.e. as imm8[a:b:c]
.
AVX512 will be fun to play with when it eventually appears in mainstream desktop/laptop CPUs, but that's probably a couple years away still.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…