c++ - How to deal with excess precision in floating-point computations?

Question

Welcome To Ask or Share your Answers For Others

c++ - How to deal with excess precision in floating-point computations?

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

c++ - How to deal with excess precision in floating-point computations?

In my numerical simulation I have code similar to the following snippet

double x;
do {
  x = /* some computation */;
} while (x <= 0.0);
/* some algorithm that requires x to be (precisely) larger than 0 */

With certain compilers (e.g. gcc) on certain platforms (e.g. linux, x87 math) it is possible that x is computed in higher than double precision ("with excess precision"). (Update: When I talk of precision here, I mean precision /and/ range.) Under these circumstances it is conceivable that the comparison (x <= 0) returns false even though the next time x is rounded down to double precision it becomes 0. (And there's no guarantee that x isn't rounded down at an arbitrary point in time.)

Is there any way to perform this comparison that

is portable,
works in code that gets inlined,
has no performance impact and
doesn't exclude some arbitrary range (0, eps)?

I tried to use (x < std::numeric_limits<double>::denorm_min()) but that seemed to significantly slow down the loop when working with SSE2 math. (I know that denormals can slow down a computation, but I didn't expect them to be slower to just move around and compare.)

Update: An alternative is to use volatile to force x into memory before the comparison, e.g. by writing

} while (*((volatile double*)&x) <= 0.0);

However, depending on the application and the optimizations applied by the compiler, this solution can introduce a noticeable overhead too.

Update: The problem with any tolerance is that it's quite arbitrary, i.e. it depends on the specific application or context. I'd prefer to just do the comparison without excess precision, so that I don't have to make any additional assumptions or introduce some arbitrary epsilons into the documentation of my library functions.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T03:06:52+0000

As Arkadiy stated in the comments, an explicit cast ((double)x) <= 0.0 should work - at least according to the standard.

C99:TC3, 5.2.4.2.2 §8:

Except for assignment and cast (which remove all extra range and precision), the values of operations with floating operands and values subject to the usual arithmetic conversions and of floating constants are evaluated to a format whose range and precision may be greater than required by the type. [...]

If you're using GCC on x86, you can use the flags -mpc32, -mpc64 and -mpc80 to set the precision of floating-point operations to single, double and extended double precision.

Categories

c++ - How to deal with excess precision in floating-point computations?

c++ - How to deal with excess precision in floating-point computations?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags