The goal of this story

Understand Python decimal representations
Know the error due to the representation of decimal numbers and the error of calculation

Story to add 10 times

Adding 1 10 times gives more than 10 (or equal to 10).

`int_sum.py`


x = 0
for i in range(10):
    x += 1
print(x < 10)
print(x == 10)
print(type(x))

It's natural, isn't it?

% python int_sum.py
False
True
<type 'int'>

So what if you add 0.1 10 times?

`float_sum.py`


x = 0.0
for i in range(10):
    x += 0.1
print(x < 1)
print(type(x))

Not more than 1 ...

% python float_sum.py
True
<type 'float'>

Numerical values are represented by binary numbers on a computer, but this happens because decimal numbers are represented by finite-digit binary numbers.

Detailed explanation http://docs.python.jp/3/tutorial/floatingpoint.html However, let's take a look at the expression method of "floating point numbers", which is the premise of the explanation.

How to represent numbers in binary

Before getting into the main subject, let's review how to represent numbers in binary.

To represent numbers in decimal, use numbers from 0 to 9 and powers of 10, for example.

2015
= 2*1000 + 0*100  + 1*10   + 5*1
= 2*10^3 + 0*10^2 + 1*10^1 + 5*10^0

I wrote like this.

Similarly, in binary numbers, the numbers 0 to 1 and the power of 2

11001011
= 1*2^7 + 1*2^6 + 0*2^5 + 0*2^4 + 1*2^3 + 0*2^2 + 1*2^1 + 1*2^0

Represents a number like. In addition, since binary notation has many digits and is difficult to write, it is often written in a compact form using hexadecimal numbers when explaining. For example, the binary number 11001011 is written as 0xcb in the hexadecimal number.

Floating point representation

Let's analyze what the decimal representation actually looks like using a simple C program.

`float.c`


#include <stdio.h>

int main(int argc, char *argv[])
{
  double z;

  if (argc != 2) {
    fprintf(stderr, "usage: %s str\n", argv[0]);
    return -1;
  }
  if (sscanf(argv[1], "%lf", &z) != 1) {
    fprintf(stderr, "parse failed\n");
    return -1;
  }
  printf("double: %f\n", z);
  printf("hex: %016llx\n", *(unsigned long long *)&z);
  return 0;
}

In this program

Read the number given as an argument as a decimal
Print the read value as a decimal decimal
Print the bit representation in hexadecimal

I am doing that.

% gcc -o float float.c

If you compile with, give an argument and execute it, it will be as follows.

% ./float 1
double: 1.000000
hex: 3ff0000000000000
% ./float -1
double: -1.000000
hex: bff0000000000000
% ./float 0.5
double: 0.500000
hex: 3fe0000000000000
% ./float -0.5
double: -0.500000
hex: bfe0000000000000
% ./float 2
double: 2.000000
hex: 4000000000000000

Comparing the hexadecimal notation of 1 and -1, 0.5 and -0.5, it seems that the difference in sign corresponds to the difference between 3 and b at the beginning. When expressed in binary numbers

0011
1011

So you can expect a positive number if the most significant bit is 0 and a negative number if it is 1. If it is correct

4 (hexadecimal) = 0100 (binary)
c (hexadecimal) = 1100 (hexadecimal)

So -2 should be c000000000000000. Let's actually try it.

% ./float -2
double: -2.000000
hex: c000000000000000

It seems that the expectation was correct. Next, if we focus on the part excluding the sign bit

2 (2^1): 4000000000000000
1 (2^0): 3ff0000000000000
0.5 (2^(-1)): 3fe0000000000000

And, when doubled, it seems that the first 3 characters (11-bit value excluding the sign bit) are incremented by 1. Therefore, 4 (2 ^ 2) can be expected to be 4010000000000000, but when you actually check it,

% ./float 4
double: 4.000000
hex: 4010000000000000

It became. In other words, it was found that the 11 bits following the sign bit represent the exponent (the number multiplied by 2).

So what do the remaining 13 characters (52 bits) represent? Given the binary number 1.1, that is, 1 + 1/2 = 1.5

% ./float 1.5
double: 1.500000
hex: 3ff8000000000000

0x8 (1000) appears at the top. Binary 1.11 That is, 1 + 1/2 + 1/4 = 1.75

% ./float 1.75
double: 1.750000
hex: 3ffc000000000000

In this case it starts at 0xc (1100). Then the binary number 1.111, that is, 1 + 1/2 + 1/4 + 1/8 = 1.875

% ./float 1.875
double: 1.875000
hex: 3ffe000000000000

And if you summarize up to here

1.0000: 0000
1.1000: 1000
1.1100: 1100
1.1110: 1110

Comparing the left and right, it seems that the value on the left is the value on the right, with the 1 above the decimal point omitted. So what happens with 1.00001? If you give 1 + 1/32 = 1.03125 as an argument

% ./float 1.03125
double: 1.031250
hex: 3ff0800000000000

It became 0x08 (0000 1000).

1.00001: 00001

Again, this can be explained by the omission of the 1 above the decimal point. In the case of the power of 2 that was seen at the beginning, the lower 52 bits were 0, but if the power of 2 is expressed in binary, the most significant power is 1 and the rest is 0, so that is also explained. Is attached.

From what we have observed so far, the expression of decimal numbers can be expressed as an expression.

(-1)^(x >> 63) * 2^((0x7ff & (x >> 52)) - 0x3ff) * ((x & 0x000fffffffffffff) * 2^(-52) + 1)

It will be. Note that 0 cannot be expressed in this form, but it plays an important role when multiplying, so

% ./float 0
double: 0.000000
hex: 0000000000000000

And all the bits are set to 0.

Summary

The most significant bit represents the sign
The following 11 bits represent the exponent part
The mantissa is represented by 53 bits, which is the remaining 52 bits plus the implicit most significant bit.
0 means that all bits are 0

Decimal numbers were expressed in the form of.

addition

Let's print the value in the middle by the calculation of adding 0.1 10 times, which we saw at the beginning.

`float_sum.c`


#include <stdio.h>

int main()
{
  double z = 0.0;
  for (int i = 0; i < 10; i++)
  {
    z += 0.1;
    printf("hex: %016llx\n", *(unsigned long long *)&z);
  }
  return 0;
}

% gcc -o float_sum float_sum.c
% ./float_sum
hex: 3fb999999999999a
hex: 3fc999999999999a
hex: 3fd3333333333334
hex: 3fd999999999999a
hex: 3fe0000000000000
hex: 3fe3333333333333
hex: 3fe6666666666666
hex: 3fe9999999999999
hex: 3feccccccccccccc
hex: 3fefffffffffffff

First, the first line should represent the decimal number 0.1,

2^((0x3fb - 0x3ff) * (0x1999999999999a * 2^(-52))
= 0x1999999999999a * 2^(-56)
= 7205759403792794.0 / 72057594037927936.0

It is not possible to accurately represent 0.1 (it is a little larger than 0.1). When the decimal number 0.1 is expressed as a binary number, it becomes a recurring decimal number, and in the hexadecimal number, 9 continues infinitely, but it shows that there is an error in expressing it in a finite number of digits.

Next, let's see how to calculate addition. Floating-point numbers were represented by exponents and mantissa, but their addition is in decimal.

10 * 3 + 100 * 5 = 100 * (0.3 + 5) = 100 * 5.3

Do something similar to. In fact, if you write a decimal in the form (exponent, mantissa),

3fb999999999999a + 3fb999999999999a
= (3fb, 0x1999999999999a) + (3fb, 0x1999999999999a)
= (3fb, 0x1999999999999a + 0x1999999999999a)
= (3fb, 0x33333333333334)
= (3fc, 0x1999999999999a)
= 3fc999999999999a

3fc999999999999a + 3fb999999999999a
= (3fc, 0x1999999999999a) + (3fb, 0x1999999999999a)
= (3fc, 0x1999999999999a) + (3fc, 0xccccccccccccd)
= (3fc, 0x1999999999999a + 0xccccccccccccd)
= (3fc, 0x26666666666667)
= (3fd, 0x13333333333334)
= 3fd3333333333334

Here is the rounding rule for mantissa shifts

Truncate if the most significant bit of the shifted out bit is 0
Round up if the most significant bit of the shifted out bit is 1 and the bit other than the shifted out most significant bit has 1.
If the most significant bit of the shifted out bit is 1 and there is no 1 in the bits other than the shifted out most significant bit
Round up if the least significant bit of the remaining number is 1.
Truncate if the least significant bit of the remaining number is 0

I'm using

0x33333333333334 = ...00110100
1 bit shift →...1010 = 0x1999999999999a

0x999999999999a = ...10011010
1 bit shift →...1101 = 0xccccccccccccd

0x26666666666667 = ...01100111
1 bit shift →...0100 = 0x13333333333334

Again, we need to round to represent the exponent with 53 bits, and we can see that there is an additional error. Due to these errors, adding 0.1 10 times will result in less than 1.

Python decimal representation

Now that you understand the representation of floating point numbers using a C program, Finally, let's make sure that Python actually uses the same expression.

`float_sum_hex.py`


x = 0.0
for i in range(10):
    x += 0.1
    print(x.hex())

% python float_sum_hex.py
0x1.999999999999ap-4
0x1.999999999999ap-3
0x1.3333333333334p-2
0x1.999999999999ap-2
0x1.0000000000000p-1
0x1.3333333333333p-1
0x1.6666666666666p-1
0x1.9999999999999p-1
0x1.cccccccccccccp-1
0x1.fffffffffffffp-1

The writing method is slightly different, but the same values as those seen in C language appear in each.

In fact, if you look at the CPython implementation, the Python float value is represented by a C double.

`cpython/Include/floatobject.h`


typedef struct {
    PyObject_HEAD
    double ob_fval;
} PyFloatObject;

Summary

Read the decimal representation of Python (and C) (IEEE 754) based on some values
Python float (C double) is represented by sign (1), exponent (11), mantissa (52 + 1) using 64 bits.
A fixed number of bits can represent a number with a certain degree of precision.
Due to an error, it may happen that even if 0.1 is added 10 times, it does not become 1.
Error caused by expressing 0.1 as a finite digit binary number
Error that occurs when performing digit alignment for addition
To accurately represent decimal numbers → decimal (http://docs.python.jp/3/library/decimal.html)

Python decimals

The goal of this story

Story to add 10 times

int_sum.py

float_sum.py

How to represent numbers in binary

Floating point representation

float.c

addition

float_sum.c

Python decimal representation

float_sum_hex.py

cpython/Include/floatobject.h

Summary

`int_sum.py`

`float_sum.py`

`float.c`

`float_sum.c`

`float_sum_hex.py`

`cpython/Include/floatobject.h`