Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
145 views
in Technique[技术] by (71.8m points)

c++ - What is a subnormal floating point number?

isnormal() reference page tells :

Determines if the given floating point number arg is normal, i.e. is neither zero, subnormal, infinite, nor NaN.

A number being zero, infinite or NaN is clear what it means. But it also says subnormal. When is a number subnormal?

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

IEEE 754 basics

First let's review the basics of IEEE 754 numbers are organized.

We'll focus on single precision (32-bit), but everything can be immediately generalized to other precisions.

The format is:

  • 1 bit: sign
  • 8 bits: exponent
  • 23 bits: fraction

Or if you like pictures:

enter image description here

Source.

The sign is simple: 0 is positive, and 1 is negative, end of story.

The exponent is 8 bits long, and so it ranges from 0 to 255.

The exponent is called biased because it has an offset of -127, e.g.:

  0 == special case: zero or subnormal, explained below
  1 == 2 ^ -126
    ...
125 == 2 ^ -2
126 == 2 ^ -1
127 == 2 ^  0
128 == 2 ^  1
129 == 2 ^  2
    ...
254 == 2 ^ 127
255 == special case: infinity and NaN

The leading bit convention

(What follows is a fictitious hypothetical narrative, not based on any actual historical research.)

While designing IEEE 754, engineers noticed that all numbers, except 0.0, have a one 1 in binary as the first digit. E.g.:

25.0   == (binary) 11001 == 1.1001 * 2^4
 0.625 == (binary) 0.101 == 1.01   * 2^-1

both start with that annoying 1. part.

Therefore, it would be wasteful to let that digit take up one precision bit almost every single number.

For this reason, they created the "leading bit convention":

always assume that the number starts with one

But then how to deal with 0.0? Well, they decided to create an exception:

  • if the exponent is 0
  • and the fraction is 0
  • then the number represents plus or minus 0.0

so that the bytes 00 00 00 00 also represent 0.0, which looks good.

If we only considered these rules, then the smallest non-zero number that can be represented would be:

  • exponent: 0
  • fraction: 1

which looks something like this in an hex fraction due to the leading bit convention:

1.000002 * 2 ^ (-127)

where .000002 is 22 zeroes with a 1 at the end.

We cannot take fraction = 0, otherwise that number would be 0.0.

But then the engineers, who also had a keen aesthetic sense, thought: isn't that ugly? That we jump from straight 0.0 to something that is not even a proper power of 2? Couldn't we represent even smaller numbers somehow? (OK, it was a bit more concerning than "ugly": it was actually people were getting bad results for their computations, see "How subnormals improve computations" below).

Subnormal numbers

The engineers scratched their heads for a while, and came back, as usual, with another good idea. What if we create a new rule:

If the exponent is 0, then:

  • the leading bit becomes 0
  • the exponent is fixed to -126 (not -127 as if we didn't have this exception)

Such numbers are called subnormal numbers (or denormal numbers which is synonym).

This rule immediately implies that the number such that:

  • exponent: 0
  • fraction: 0

is still 0.0, which is kind of elegant as it means one less rule to keep track of.

So 0.0 is actually a subnormal number according to our definition!

With this new rule then, the smallest non-subnormal number is:

  • exponent: 1 (0 would be subnormal)
  • fraction: 0

which represents:

1.0 * 2 ^ (-126)

Then, the largest subnormal number is:

  • exponent: 0
  • fraction: 0x7FFFFF (23 bits 1)

which equals:

0.FFFFFE * 2 ^ (-126)

where .FFFFFE is once again 23 bits one to the right of the dot.

This is pretty close to the smallest non-subnormal number, which sounds sane.

And the smallest non-zero subnormal number is:

  • exponent: 0
  • fraction: 1

which equals:

0.000002 * 2 ^ (-126)

which also looks pretty close to 0.0!

Unable to find any sensible way to represent numbers smaller than that, the engineers were happy, and went back to viewing cat pictures online, or whatever it is that they did in the 70s instead.

As you can see, subnormal numbers do a trade-off between precision and representation length.

As the most extreme example, the smallest non-zero subnormal:

0.000002 * 2 ^ (-126)

has essentially a precision of a single bit instead of 32-bits. For example, if we divide it by two:

0.000002 * 2 ^ (-126) / 2

we actually reach 0.0 exactly!

Visualization

It is always a good idea to have a geometric intuition about what we learn, so here goes.

If we plot IEEE 754 floating point numbers on a line for each given exponent, it looks something like this:

          +---+-------+---------------+-------------------------------+
exponent  |126|  127  |      128      |              129              |
          +---+-------+---------------+-------------------------------+
          |   |       |               |                               |
          v   v       v               v                               v
          -------------------------------------------------------------
floats    ***** * * * *   *   *   *   *       *       *       *       *
          -------------------------------------------------------------
          ^   ^       ^               ^                               ^
          |   |       |               |                               |
          0.5 1.0     2.0             4.0                             8.0

From that we can see that:

  • for each exponent, there is no overlap between the represented numbers
  • for each exponent, we have the same number 2^23 of floating point numbers (here represented by 4 *)
  • within each exponent, points are equally spaced
  • larger exponents cover larger ranges, but with points more spread out

Now, let's bring that down all the way to exponent 0.

Without subnormals, it would hypothetically look like:

          +---+---+-------+---------------+-------------------------------+
exponent  | ? | 0 |   1   |       2       |               3               |
          +---+---+-------+---------------+-------------------------------+
          |   |   |       |               |                               |
          v   v   v       v               v                               v
          -----------------------------------------------------------------
floats    *    **** * * * *   *   *   *   *       *       *       *       *
          -----------------------------------------------------------------
          ^   ^   ^       ^               ^                               ^
          |   |   |       |               |                               |
          0   |   2^-126  2^-125          2^-124                          2^-123
              |
              2^-127

With subnormals, it looks like this:

          +-------+-------+---------------+-------------------------------+
exponent  |   0   |   1   |       2       |               3               |
          +-------+-------+---------------+-------------------------------+
          |       |       |               |                               |
          v       v       v               v                               v
          -----------------------------------------------------------------
floats    * * * * * * * * *   *   *   *   *       *       *       *       *
          -----------------------------------------------------------------
          ^   ^   ^       ^               ^                               ^
          |   |   |       |               |                               |
          0   |   2^-126  2^-125          2^-124                          2^-123
              |
              2^-127

By comparing the two graphs, we see that:

  • subnormals double the length of range of exponent 0, from [2^-127, 2^-126) to [0, 2^-126)

    The space between floats in subnormal range is the same as for [0, 2^-126).

  • the range [2^-127, 2^-126) has half the number of points that it would have without subnormals.

    Half of those points go to fill the other half of the range.

  • the range [0, 2^-127) has some points with subnormals, but none without.

    This lack of points in [0, 2^-127) is not very elegant, and is the main reason for subnormals to exist!

  • since the points are equally spaced:

    • the range [2^-128, 2^-127) has half the points than [2^-127, 2^-126) -[2^-129, 2^-128) has half the points than [2^-128, 2^-127)
    • and so on

    This is what we mean when saying that subnormals are a tradeoff between size and precision.

Runnable C example

Now let's play with some actual code to verify our theory.

In almost all current and desktop machines, C float represents single precision IEEE 754 floating point numbers.

This is in particular the case for my Ubuntu 18.04 amd64 Lenovo P51 laptop.

With that assumption, all assertions pass on the following program:

subnormal.c

#if __STDC_VERSION__ < 201112L
#error C11 required
#endif

#ifndef __STDC_IEC_559__
#error IEEE 754 not implemented
#endif

#include <assert.h>
#include <float.h> /* FLT_HAS_SUBNORM */
#include <inttypes.h>
#include <math.h> /* isnormal */
#include <stdlib.h>
#include <stdio.h>

#if FLT_HAS_SUBNORM != 1
#error float does not have subnormal numbers
#endif

typedef struct {
    uint32_t sign, exponent, fraction;
} Float32;

Float32 float32_from_float(float f) {
    uint32_t bytes;
    Float32 float32;
    bytes = *(uint32_t*)&f;
    float32.fraction = bytes & 0x007FFFFF;
    bytes >>= 23;
    float32.exponent = bytes & 0x000000FF;
    bytes >>= 8;
    float32.sign = bytes & 0x000000001;
    bytes >>= 1;
    return float32;
}

float float_from_bytes(
    uint32_t sign,
    uint32_t exponent,
    uint32_t fraction
) {
    uint32_t bytes;
    bytes = 0;
    bytes |= sign;
    bytes <<= 8;
    bytes |= exponent;
    bytes <<= 23;
    bytes |= fraction;
    return *(float*)&bytes;
}

int float32_equal(
    float f,
    uint32_t sign,
    uint32_t exponent,
    uint32_t fraction
) {
    Float32 float32;
    float32 = float32_from_float(f);
    return
        (float32.sign     == sign) &&
        (float32.exponent == exponent) &&
        (float32.fraction == fraction)
    ;
}

void float32_print(float f) {
    Float32 float32 = float32_from_float(f);
    printf(
        "%&quot

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...