You can abuse PSADBW
to calculate small horizontal sums quickly.
Something like this: (not tested)
pxor xmm0, xmm0
psadbw xmm0, [a + 0]
pxor xmm1, xmm1
psadbw xmm1, [a + 16]
paddw xmm0, xmm1
pshufd xmm1, xmm0, 2
paddw xmm0, xmm1 ; low word in xmm0 is the total sum
Attempted intrinsics version:
I never use intrinsics so this code probably makes no sense whatsoever. The disassembly looked OK though.
uint16_t sum_32(const uint8_t a[32])
{
__m128i zero = _mm_xor_si128(zero, zero);
__m128i sum0 = _mm_sad_epu8(
zero,
_mm_load_si128(reinterpret_cast<const __m128i*>(a)));
__m128i sum1 = _mm_sad_epu8(
zero,
_mm_load_si128(reinterpret_cast<const __m128i*>(&a[16])));
__m128i sum2 = _mm_add_epi16(sum0, sum1);
__m128i totalsum = _mm_add_epi16(sum2, _mm_shuffle_epi32(sum2, 2));
return totalsum.m128i_u16[0];
}
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…