floating point - How to weigh up calculation error

Question

Welcome To Ask or Share your Answers For Others

floating point - How to weigh up calculation error

posted Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

floating point - How to weigh up calculation error

Consider the following example. There is an image where user can select rectangular area (part of it). The image is displayed with some scale. Then we change the scale and we need to recalculate the new coordinates of selection. Let's take width,

newSelectionWidth = round(oldSelectionWidth / oldScale * newScale)

where oldScale = oldDisplayImageWidth / realImageWidth, newScale = newDisplayImageWidth / realImageWidth, all the values except for scales are integers.

The question is how to prove that newSelectionWidth = newDisplayImageWidth given oldSelectionWidth = oldDisplayImageWidth for any value of oldDisplayImageWidth, newDisplayImageWidth, realImageWidth? Or under what conditions this doesn't hold?

I was thinking about the answer too and this is what I've come up with, may be inaccurate and/or incomplete.

All numbers in JavaScript are double-precision numbers. Generally, this gives us maximum error of about 10^-16 (machine epsilon). This means in order to have error of 0.5 or more, (1) we need to perform 0.5 / 10^-16 = 5·10¹⁵ operations. The other source of error is calculation with too big (|value| > 1.7976931348623157·10³⁰⁸) or too low numbers (|value| < 2.2250738585072014·10^-308) (link). This means (2) if somewhere in the course of calculation we get too big or too low number, e.g. because oldDisplayImageWidth / realImageWidth > 1.7976931348623157·10³⁰⁸ or the like, then the error might exceed 0.5. Granted we're talking about displaying images on today's monitors, all these conditions are extremely unlikely.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2022-01-31T07:05:57+0000

If newDisplayWidth is less than 1125899906842624 and the other integers are positive and do not exceed 53 bits, then newSelectionWidth equals newDisplayWidth. A proof follows.

Notation:

I will use the term double to name the floating-point type being used, IEEE-754 64-bit binary.
Text in code style represents computed values, while plain text represents mathematical values. Thus 1/3 is exactly one-third, while 1./3. is the result of dividing 1 by 3 in floating-point arithmetic.

I assume:

The widths are positive integers not wider than the double significand (53 bits).
The divisions oldDisplayImageWidth / realImageWidth and newDisplayImageWidth / realImageWidth are performed in double arithmetic with the operands converted to double.

The limits on the integers assures that conversion to double is exact and that overflow and underflow are not encountered during the operations used in this problem.

Consider oldScale, which is a double set to oldDisplayImageWidth / realImageWidth. The maximum error in a single floating-point operation in round-to-nearest mode is half an ULP (because every mathematical number is no farther than half an ULP from a representable number). Thus, oldScale equals oldDisplayImageWidth / realImageWidth ? (1+e₀), where e₀ represents the relative error and is at most half a double epsilon. (The double epsilon is 2^-52, so |e₀| ≤ 2^-53.)

Similarly, newScale is newDisplayImageWidth / realImageWidth ? (1+e₁), where e₁ is some error that is at most 2^-53.

Then oldSelectionWidth / oldScale is oldSelectionWidth / oldScale ? (1+e₂), again for some e₂ ≤ 2^-53, and oldSelectionWidth / oldScale * newScale is oldSelectionWidth / oldScale ? (1+e₂) ? newScale ? (1+oldSelectionWidth / oldScale ? (1+e₃). Note that this is the argument passed to round.

Now substitute the expressions we have for oldScale and newScale. This yields oldSelectionWidth / (oldDisplayImageWidth / realImageWidth ? (1+e₀)) ? (1+e₂) ? (newDisplayImageWidth / realImageWidth ? (1+e₁)) ? (1+e₃). The realImageWidth terms cancel, and we can rearrange the others to produce oldSelectionWidth ? newDisplayImageWidth / oldDisplayImageWidth ? (1+e₁) ? (1+e₂) ? (1+e₃) / (1+e₀).

We are given that oldSelectionWidth equals oldDisplayImageWidth, so those cancel, and the argument to round is exactly: newDisplayImageWidth ? (1+e₁) ? (1+e₂) ? (1+e₃) / (1+e₀).

Consider the combined error terms minus one (this is the relative error in the final value): (1+e₁) ? (1+e₂) ? (1+e₃) / (1+e₀) – 1. This expression has greatest magnitude when e₀ is –2^-53 and the others are +2^-53. Then it is slightly greater than 2 ULP (at most 324518553658426753804753784799233 / 730750818665451377972204001751459814038961127424). If newDisplayImageWidth is less than 1125899906842624, then newDisplayImageWidth times this relative error is less than ?. Therefore, newDisplayImageWidth ? (1+e₁) ? (1+e₂) ? (1+e₃) / (1+e₀) would be within ? of newDisplayImageWidth.

Since newDisplayImageWidth is an integer, if the argument to round is within ? of newDisplayWidth, then the result is newDisplayWidth.

Therefore, if newDisplayWidth is less than 1125899906842624, then newSelectionWidth equals newDisplayWidth.

(The above proves that 1125899906842624 is a sufficient limit, but it may not be necessary. A more involved analysis may be able to prove that certain combinations of errors are impossible, so the maximum combined error is less than used above. This would relax the limit, allowing larger values of newDisplayWidth.)

Categories

floating point - How to weigh up calculation error

floating point - How to weigh up calculation error

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags