From different inputs, I gathered these solutions. The key to crossing the inter-lane barrier is the align instruction, _mm256_alignr_epi8
.
_mm256_slli_si256(A, N)
0 < N < 16
_mm256_alignr_epi8(A, _mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 2, 0)), 16 - N)
N = 16
_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 2, 0))
16 < N < 32
_mm256_slli_si256(_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 2, 0)), N - 16)
_mm256_srli_si256(A, N)
0 < N < 16
_mm256_alignr_epi8(_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(2, 0, 0, 1)), A, N)
N = 16
_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(2, 0, 0, 1))
16 < N < 32
_mm256_srli_si256(_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(2, 0, 0, 1)), N - 16)
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…