x86 - What is the difference between _mm512_load_epi32 and _mm512_load_si512?

Question

Welcome To Ask or Share your Answers For Others

x86 - What is the difference between _mm512_load_epi32 and _mm512_load_si512?

1 Reply

深蓝 · Answer 1 · 2021-10-23T21:28:51+0000

There's no difference, it's just silly redundant naming. Use _mm512_load_si512 for clarity. Thanks, Intel. As usual, it's easier to understand the underlying asm for AVX512, and then you can see what the clumsy intrinsic naming is trying to say. Or at least you can understand how we ended up with this mess of different documentation suggesting _mm512_load_epi32 vs. _mm512_load_si512.

Almost all AVX512 instructions support merge-masking and zero-masking. (e.g. vmovdqa32 can do a masked load like vmovdqa32 zmm0{k1}{z}, [rdi] to zero vector elements where k1 had a zero bit), which is why different element-size versions of things like vector loads and bitwise operations exist. (e.g. vpxord vs. vpxorq).

But these intrinsics are for the no-masking version. The element-size is totally irrelevant. I'm guessing _mm512_load_epi32 exists for consistency with _mm512_mask_load_epi32 (merge-masking) and _mm512_maskz_load_epi32 (zero-masking). See the docs for the vmovdqa32 asm instruction.

e.g. _mm512_maskz_loadu_epi64(0x55, x) zeros the odd elements for free while loading. (At least it's free if the cost of putting 0x55 into a k register can be hoisted out of a loop. And if we haven't defeated the chance for the compiler to fold a load into a memory operand for an ALU instruction.)

When elements are all loaded into the destination unchanged, element boundaries are meaningless. That's why AVX2 and earlier don't have different element-size versions of bitwise booleans like _mm_xor_si128 and loads/stores like _mm_load_si128.

Some compilers don't support the element-width names for unaligned unmasked loads. e.g. current gcc doesn't support _mm512_loadu_epi64 even though it's supported _mm512_load_epi64 since the first gcc version to support AVX512 intrinsics at all. (See error: '_mm512_loadu_epi64' was not declared in this scope)

There are no CPUs where the choice of vmovdqa64 vs. vmovdqa32 matters at all for efficiency, so there's zero point in trying to hint the compiler to use one or the other, regardless of the natural element width of your data.

Only FP vs. integer might matter for loads, and Intel's intrinsics already uses different types (__m512 vs. __m512i) for that.

Categories

x86 - What is the difference between _mm512_load_epi32 and _mm512_load_si512?

x86 - What is the difference between _mm512_load_epi32 and _mm512_load_si512?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags