Suppose I want to add two buffers and store the result. Both buffers are already allocated 16byte aligned. I found two examples how to do that.
The first one is using _mm_load to read the data from the buffer into an SSE register, does the add operation and stores back to the result register. Until now I would have done it like that.
void _add( uint16_t * dst, uint16_t const * src, size_t n )
{
for( uint16_t const * end( dst + n ); dst != end; dst+=8, src+=8 )
{
__m128i _s = _mm_load_si128( (__m128i*) src );
__m128i _d = _mm_load_si128( (__m128i*) dst );
_d = _mm_add_epi16( _d, _s );
_mm_store_si128( (__m128i*) dst, _d );
}
}
The second example just did the add operations directly on the memory addresses without load/store operation. Both seam to work fine.
void _add( uint16_t * dst, uint16_t const * src, size_t n )
{
for( uint16_t const * end( dst + n ); dst != end; dst+=8, src+=8 )
{
*(__m128i*) dst = _mm_add_epi16( *(__m128i*) dst, *(__m128i*) src );
}
}
So the question is if the 2nd example is correct or may have any side effects and when to use load/store is mandatory.
Thanks.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…