In depth C and C++ floating point number in memory storage

  • 2020-04-02 00:48:01
  • OfStack

Any data is stored in memory in binary form, such as a short 1156 with a binary representation of 00000100 10000100. In the system of Intel CPU architecture, the storage mode is   10000100(low address cell) 00000100(high address cell) because the architecture of the Intel CPU is in small-end mode. But how are floating point Numbers stored in memory? Currently, all C/C++ compilers adopt the standard floating-point format developed by IEEE, namely binary scientific notation.
In binary scientific notation, S=M*2^N is mainly composed of three parts: sign bit + order code (N)+ mantissa (M). For float data, its binary has 32 bits, in which symbol bit 1 bit, order code 8 bits, mantisson 23 bits; For double data, its binary is 64 bits, symbol bit 1 bit, order code 11 bits, mantisson 52 bits.
                              31               30-23             22-0
float             The sign bit         exponent               mantissa
                              63               62-52             51-0
A double       The sign bit         exponent               mantissa
Sign bit: 0 for positive, 1 for negative
Order code: Here, the order code is represented by a code shift. For float data, the specified bias is 127, and the order code has positive and negative values. For 8-bit binary, the range is -128-127, and for double, the range is -1024-1023. For example, for float data, if the real value of the rank code is 2, then the value after adding 127 is 129, and the rank code representation is 10000010
Mantisses: significant digit bits, that is, partial binary bits (the bits behind the decimal point), because the integral part of M is always 1, so the 1 is not stored.
Here are some examples:
Float data 125.5 converted to standard floating point format
Binary representation of 1111101 to 125, the decimal part is expressed as a binary 1, binary representation is 125.5 to 1111101.1, due to the provisions of the mantissa part integer constant is 1, said 1.1111011 * 2 ^ 6, exponent is 6, plus 127 to 133, said in 10000101, and for the mantissa to integer part 1, 1111011, at the back of the fill the digits 0 to 23, 11110110000000000000000
Then its binary representation is
0 10000101 111101100000000000000000000, then the storage mode in memory is:
00000000     Lower address
00000000
11111011
01000010     High address
And the other way around is to compute floating point Numbers in binary form like 0, 10000101, 111101100000000000000000000
Since the sign is 0, it is positive. The order code is 133-127=6, and the mantailus is 111101100000000000000000000, so the true mantailus is 1.1111011. So its magnitude is zero
1.1111011*2^6, move the decimal to the right 6 places to get 1111101.1, while 1111101 has a decimal of 125 and 0.1 has a decimal of 1*2^(-1)=0.5, so its size is 125.5.
Similarly, float data 0.5 is converted to binary form
The binary form of 0.5 is 0.1, because of the provision that the positive part must be 1, the decimal point to the right to move 1, it is 1.0*2^(-1), its order code is -1+127=126, expressed as 01111110, and the mandatus 1.0 removed the integer part is 0, complement 0 to 23 digits 0000000000000000000000000000000, then its binary form is
0, 01111110, 00000000000000000000000
Known from the analysis of the float type data said the biggest range of 1.11111111111111111111111 * 2 = 3.4 * 10 ^ ^ 127 38
The situation is similar for double data, except that the order code is 11 bits, the offset quantity is 1023, and the mantisson is 52 bits.

Test procedures:

 
#include <iostream>
using namespace std;
int main(int argc, char *argv[])
{
    float a=125.5;
    char *p=(char *)&a;
    printf("%dn",*p);
    printf("%dn",*(p+1));
    printf("%dn",*(p+2));
    printf("%dn",*(p+3));
    return 0;
}

The output result is:
0
0
- 5
66

Above, it is known that float 125.5 is stored in memory as follows:
00000000     Lower address
00000000
11111011
01000010     High address
Therefore, for the cell pointed by p and p+1, the stored binary number represents the decimal integer 0;
For the cell pointed by p+2, because it is a char pointer, is a signed data type, so 11111011, sign bit is 1, is a negative number, because binary is stored in memory with complement, so its true value is -5.
For the cell pointed by p+3, 01000010, is a positive number, then its size is 66. The above program output results verify its correctness.

Related articles: