assembly - SSE/SIMD shift with one-byte element size / granularity?

Question

Welcome To Ask or Share your Answers For Others

assembly - SSE/SIMD shift with one-byte element size / granularity?

1 Reply

深蓝 · Answer 1 · 2021-10-23T21:33:50+0000

In the special-case of left-shift-by-one, you can use paddb xmm0, xmm0.

As Jester points out in comments, the best option to emulate the non-existent psrlb and psllb is to use a wider shift and then mask off any bits that crossed element boundaries.

e.g.

    psrlw   xmm0, 2       ; doesn't matter what size (w/d/q): performance is the same for all sizes on all CPUs
    pand    xmm0, [mask_right2]

section .rodata
  align 16
    ;; required mask depends on the shift count
    mask_right2: times 16  db 0xff >> 2      (16 bytes of 0x3f)

Or broadcast 0x3f into a vector register ahead of a loop some other way, like vpbroadcastd or vbroadcastss from a dword in memory, SSE3 movddup from a qword, or just a movdqa vector load. (vpbroadcastb takes an extra ALU uop, unlike dword or wider broadcasts which are just simple loads). Or generate on the fly with a sequence like pcmpeqd xmm0,xmm0 / psrlw xmm0, 8+2 / packuswb xmm0,xmm0. With the right choice of shift count, you can generate any pattern of 2ⁿ-1 bytes (repeated zeros and then repeated ones).

mov r32, imm32 / movd xmm, r32 and shuffle is also an option, but probably won't save instruction bytes compared to the pcmpeqw / ... sequence. (Note that the register-source version of VBROADCASTSS is AVX2-only, which doesn't matter here since 256b integer shifts are also AVX2-only.)

For a variable-count vector shift, creating the mask in an integer register and broadcasting it to a vector is one option (use pshufb with an all-zero register to broadcast the low byte, or use imul eax, eax, 0x01010101 to go from a byte to a dword for movd + pshufd). You could also use the pcmpeqd method to create an all-ones vector and use a psrlw xmm0, xmm1 and then pack or pshufb.

I don't see any similarly efficient way to emulate arithmetic right-shift (the non-existant PSRAB). The high byte of each word is handled correctly by PSRAW. Shifting the low byte of each word to the high position would let another PSRAW copy its sign bit as many times as required.

;; vpblendvb is 2 uops on Intel so this is worse throughput in loops than the pxor/paddb version
;; Latency may be the same on Skylake because this has some ILP.

; input in xmm0.  Using AVX to save on mov instructions
VPSLLDQ   xmm1, xmm0, 1      ; or VPSLLW xmm1, xmm0, 8, but this distributes one of the uops to the shuffle port
VPSRAW    xmm1, xmm1, 8+2    ; shift low bytes back to final destination

VPSRAW    xmm0, xmm0, 2      ; shift high bytes, leaving garbage in low bytes
VPBLENDVB xmm0, xmm1, xmm0, xmm2  ; (where xmm2 holds a mask of alternating 0 and -1, which could be generated with pcmpeqw / psrlw 8).  This insn is fairly slow

There is no immediate-blend with byte granularity, because a single immediate byte can only encode 8 elements.

Without VPBLENDVB (possibly better even when it's available, if generating or loading a constant for it is slow):

;; Probably worse than the PXOR/PADDB version, if 2 constants are cheap to load
;; Needs no vector constants, but this is inefficient vs. versions with constants.
VPSLLDQ   xmm1, xmm0, 1      ; or VPSLLW 8
VPSRAW    xmm1, xmm1, n      ; low bytes in the wrong place

VPSRAW    xmm0, xmm0, 8+n    ; shift high bytes all the way to the bottom of the element
VPSLLW    xmm0, xmm0, 8      ; high bytes back in place, with zero in the low byte.  (VPSLLDQ can't work: PSRAW 8+n leaves garbage we need to clear)

VPSRLW    xmm1, xmm1, 8      ; shift low bytes into place, leaving zero in the high byte.  (VPSRLDQ 1 could do this, if we started with VPSLLW instead of VPSLLDQ)
VPOR      xmm0, xmm0, xmm1

Using PAND/PANDN/POR with a constant (alternating 0/-1 bytes) in a register would also work (with far less pressure on the shift port) for doing a byte-blend, and is a better choice if you have to do this in a loop.

Sign-extending a narrow value into the rest of a byte:

Assuming each byte is zero-extended, e.g. after unpacking nibbles to bytes with AND + shift/AND. (Works for any field width, just adjust the constants.)

Flip the high zeros, and the sign bit, with XOR. Add 1 to the sign bit so it will restore the correct sign bit, and either clear the high bits via carry propagation (if it became 0 and carried out) or leave them set (if it became 1 and didn't carry).

; hoist the constants out of a loop if you're looping, of course.
; input in XMM0, upper bits of each byte already zeroed 
    pxor   xmm0,  [const_0xf8]     ;   1111 s'xxx
    paddb  xmm0,  [const_0x08]     ;   0000 0xxx   or  1111 1xxx

Using that to emulate the missing `psrab`

This is possible still with only 2 constants from memory. This is most likely the best option for a loop, especially if you have registers to spare to hoist the loads of those constants. (0xf0 can be used with vpandn to isolate a low nibble if you also need that.)

    psrld  xmm0,  4                              ;   ???? sxxx   (s = sign bit, xxx = lower bits)
    por    xmm0,  xmm5     ; set1_epi8(0xf0)     ;   1111 sxxx

    pxor   xmm0,  xmm6     ; set1_epi8(0x08)     ;   1111 s'xxx
    paddb  xmm0,  xmm6     ; set1_epi8(0x08)     ;   0000 0xxx   or  1111 1xxx

I don't think we can avoid using 2 separate booleans. We need PXOR to counter PADDB or PSUBB flipping the sign bit, but only POR can set bits regardless of their old value.

We could isolate the sign bit and left-shift it before adding or subtracting (pand + pslld + paddb) but that would be worse, especially without AVX for 3-operand instructions to avoid movdqa. It would also be more total instructions including the POR that we would still need.

Upsides:

simple instructions that can run on any vector ALU port.
Fewer uops on Intel than the vpblendvb version.

Downside:

no ILP (Instruction-level parallelism), so maybe not better latency than the vpblendvb version, especially on AMD Zen / Zen2 where vpblendvb is a single-uop instruction with only 1c latency.
Needs 2 vector constants.

sign-extension for fields <=4 bits using PSHUFB table lookup

Instead of pxor / paddb, use pshufb to look up a new value for each byte, based on the low 4 bits. Unfortunately, pshufb zeros a lane if the selector has its high bit set, so we can't use it on raw psrld results that might have shifted in a non-zero high bit.

const __m128i sext_lut = _mm_setr_epi8( 0,  1,  2,  3,  4,  5,  6,  7,
                                       -8, -7, -6, -5, -4, -3, -2, -1);
return _mm_shuffle_epi8(sext_lut, v);

With AVX for 3-operand non-destructive, this can be a single instruction reusing a lookup table in a register. Without, it'll need a movdqa to copy the LUT.

Shifting with this:

__m128i srai_4_epi8(__m128i v) {
    v = _mm_srli_epi32(v, 4);
    v = _mm_and_si128(v, _mm_set1_epi8(0x0f));
  const __m128i sext_lut = _mm_setr_epi8( 0,  1,  2,  3,  4,  5,  6,  7,
                                         -8, -7, -6, -5, -4, -3, -2, -1);
    return _mm_shuffle_epi8(sext_lut, v);
}

Categories

assembly - SSE/SIMD shift with one-byte element size / granularity?