AVX2 has lots of good stuff. For example, it has plenty of instructions which are pretty much strictly more powerful than their precursors. Take VPERMD
: it allows you to totally arbitrarily broadcast/shuffle/permute from one 256-bit long vector of 32-bit values into another, with the permutation selectable at runtime1. Functionally, that obsoletes a whole slew of existing old unpack, broadcast, permute, shuffle and shift instructions3.
Cool beans.
So where is VPERMB
? I.e., the same instruction, but working on byte-sized elements. Or, for that matter, where is VPERMW
, for 16-bit elements? Having dabbled in x86 assembly for some time, it is pretty clear that the SSE PSHUFB
instruction is pretty much among the most useful instructions of all time. It can do any possible permutation, broadcast or byte-wise shuffle. Furthermore, it can also be used to do 16 parallel 4-bit -> 8-bit table lookups2.
Unfortunately, PSHUFB
wasn't extended to be cross-lane in AVX2, so it is restricted to within-lane behavior. The VPERM
instructions are able to do cross shuffle (in fact, "perm" and "shuf" seem to be synonyms in the instruction mnemonics?) - but the 8 and 16-bit versions were omitted?
There doesn't even seem to be a good way to emulate this instruction, whereas you can easily emulate the larger-width shuffles with smaller-width ones (often, it's even free: you just need a different mask).
I have no doubt that Intel is aware of the wide and heavy use of PSHUFB
, so the question naturally arises as to why the byte variant was omitted in AVX2. Is the operation intrinsically harder to implement in hardware? Are there encoding restrictions forcing its omission?
1By selectable at runtime, I mean that the mask that defines the shuffling behavior comes from a register. This makes the instruction an order of magnitude more flexible than the earlier variants that take an immediate shuffle mask, in the same way that add
is more flexible than inc
or a variable shift is more flexible than an immediate shift.
2Or 32 such lookups in AVX2.
3The older instructions are occasionally useful if they have a shorter encoding, or avoid loading a mask from memory, but functionally they are superseded.
See Question&Answers more detail:
os