With Paul R's solution and with my previous solution
the difference between the rounded floating point and the original integer is less than or equal to
0.75 ULP (Unit in the Last Place). In these methods
at two places rounding may occur: in _mm_cvtepi32_ps and
in _mm_add_ps. This leads to results that are not as accurate as possible for some inputs.
For example, with Paul R's method 0x2000003=33554435 is converted to 33554432.0, but 33554436.0
also exists as a float, which would have been better here.
My previous solution suffers from similar inaccuracies.
Such inaccurate results may also occur with compiler generated code, see here.
Following the approach of gcc (see Peter Cordes' answer to that other SO question), an accurate conversion within 0.5 ULP is obtained:
inline __m128 _mm_cvtepu32_ps(const __m128i v)
{
__m128i msk_lo = _mm_set1_epi32(0xFFFF);
__m128 cnst65536f= _mm_set1_ps(65536.0f);
__m128i v_lo = _mm_and_si128(v,msk_lo); /* extract the 16 lowest significant bits of v */
__m128i v_hi = _mm_srli_epi32(v,16); /* 16 most significant bits of v */
__m128 v_lo_flt = _mm_cvtepi32_ps(v_lo); /* No rounding */
__m128 v_hi_flt = _mm_cvtepi32_ps(v_hi); /* No rounding */
v_hi_flt = _mm_mul_ps(cnst65536f,v_hi_flt); /* No rounding */
return _mm_add_ps(v_hi_flt,v_lo_flt); /* Rounding may occur here, mul and add may fuse to fma for haswell and newer */
} /* _mm_add_ps is guaranteed to give results with an error of at most 0.5 ULP */
Note that other high bits/low bits partitions are possible as long as _mm_cvt_ps can convert
both pieces to floats without rounding.
For example, a partition with 20 high bits and 12 low bits will work equally well.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…