# Introduction

Recently (November 2018), Togetter summary [It was written in BASIC 30 years ago that "If you add 0.01 10,000 times, it will be 100.003", so when I tried it in the current environment, the result was the same](https //togetter.com/li/1289806) became a hot topic.

For example, in a similar story, adding 0.01 100 times in Ruby (x86_64-Linux) does not result in exactly 1, and there is a slight error (see the following example). The situation will be similar for other programming languages.

#### Ruby(x86_64-Linux)With 0.When 01 is added 100 times


$irb irb(main):001:0> 100.times.reduce(0){|s,_| s+0.01 } => 1.0000000000000007  In the summary of the case, there is an opinion that "in this era ...", but ** what is floating point ** For those who know it, even if it is Atari Mae's knowledge, otherwise ** It looks like mere decimal data **, and if you are not good at it, you often get the impression that "there is a gap in the calculation!". … Originally, I think that you should acquire knowledge in introductory books (because it is a type of data that is standard in most languages), but is there a proper explanation? I also think, so I decided to write a rough article. # What is a floating point? ## Fixed point and floating point In a nutshell, floating point is not a decimal number that the average person would imagine, but ** numerical data that assumes that an error will occur in the calculation **. Conversely, ** when using floating point, be aware that it is an approximate calculation, and consider the accuracy of the result **. With that alone, you may be wondering why you aren't doing that cool, but it's ** computing with finite resources ** the fate of a computer, and the point of where to compromise. Is the difference. The opposite of floating point is ** fixed point **, which is generally integer data. This can always be handled in increments of 1, and as long as you handle integers, there will be no error, but ** the range of numbers that can be handled is not very wide **. • For example, a 64-bit integer has 19 digits as a decimal number in the range of -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807. Floating point, on the other hand, can handle a very wide range of numbers ** at the cost of error. From nano-level numbers close to 0, Avogadro constant (recently revised definition) and space-scale numbers are also available. • For example, a 64-bit floating point number called double precision covers a range of about$ 10 ^ {-308} $at the closest point to 0, about$ 10 ^ {308} $at the largest point, and about 600 decimal digits. .. However, since the precision is about 16 digits in decimal, the calculation result that does not fit in this number of digits will be lost and will become an error. And in most programming languages, it's common to handle decimal data in floating point. ## Why decimals in floating point? That said, some people may think this. "No, I don't need such nano or space. Even though I handle about 0.01 at most, please forgive me for errors." As a matter of fact, if you only deal with decimal numbers that appear in currencies and tax rates, that feeling is plausible. But then, how many decimal places should a computer cover? And here, you should have already learned in elementary school. For example, ** Pi should be infinite decimal ** (3.14 is just an approximation). Even if you don't bring out the pi, the irrational number$ \ sqrt {2} $that Pythagoras tried to seal the existence is also an infinite decimal number, so ** how many digits should I look at **, this is It always comes with decimal calculations. In the end, since it cannot be expressed as a finite value, we have to stop it somewhere, and we have to consider the calculation accuracy and error. As with so-called scientific computing, this problem is unavoidable in statistics, graphic processing, etc. • It is also a way of thinking based on "approximate numbers" in the elementary school range and "significant figures" in the scientific field. So, this is just my guess, but it seems that integers are fixed-point and decimals are floating-point. ## If you still hate errors That said, there may be times when you don't want to think about errors. If so, there should be a language or library that can handle decimal points without error. For example, in legacy and notorious COBOL, the format packed decimal allows for error-free calculations with decimal numbers in the specified number of digits. It was only said that it was for paperwork. In other languages, for example, you have the option of using BigDecimal for Java, decimal for C #, and BigDecimal for Ruby (or Rational for integers divided by integers). Or, for example, if you are talking about multiplying the consumption tax rate by 8% (though floating point does not cause much error to that extent), you can simply replace it with an integer arithmetic. Like (amount) * 8/100 instead of (amount) * 0.08. Alternatively, instead of expressing$ 1.50 as $1.50, you could devise a unit of 150 cents so that it fits in the range of integers. The point is, ** Don't treat it as a decimal in the program just because it's a decimal in the real world, think first **. # Mechanism of error ## Binary representation However, on the other hand, you may be wondering, "Why does an error occur with a non-trivial number of 0.01?" It is still easy to understand that the accuracy that can be expressed by adding a very large number and a small number, such as the following, did not fit. (Ruby (x86_64-Linux) treats it as double precision floating point, so the precision is about 16 digits in decimal) #### An example of an error that does not fit within the accuracy in Ruby $ irb
irb(main):001:0> 100_000_000 + 0.000_000_01 #Fits in accuracy
=> 100000000.00000001
irb(main):002:0> 1000_000_000 + 0.000_000_001 #Do not fit
=> 1000000000.0


However, adding 0.01 100 times to 1 does not mean that there is a difference in size ...?

In this case of "adding 0.01 100 times does not become 1", the computer internally treats floating point numbers as binary numbers, and ** 0.01 becomes an infinite (recurring) decimal number in binary numbers **. It is the cause.

• As you commented, there is also a decimal number-based floating point number, which is not applicable when using that.

Note that decimal decimals and binary decimals have the following correspondence: $\begin{eqnarray} 0.5&(10)=&0.1&(2) \\\\ 0.25&(10)=&0.01&(2) \\\\ ~ &~&\vdots&~ \\\\ 0.015625&(10)=&0.000001&(2) \\\\ 0.0078125&(10)=&0.0000001&(2) \\\\ 0.00390625&(10)=&0.00000001&(2) \\\\ 0.001953125&(10)=&0.000000001&(2) \\\\ ~ &~&\vdots&~ \\\\ \end{eqnarray}$

Then, you can see (calculate) that 0.01 is a binary recurring decimal with a 20-digit cycle as follows. $0.01(10)=0.000000101000111101011100001010001111010111\cdots(2)$ Therefore, on a computer, the digits are cut off in the middle and saved. This truncated digit affects as an error as the calculation is repeated.

Conversely, ** binary sharpness ** (becomes a finite decimal number), a fraction whose denominator is a power of 2 (for example, $\ frac 1 {128} = 0.0078125 (10) = 0.000000001 (2)$ ) Does not create an error. ** It is necessary to be careful to handle it on the assumption that there is an error, but it is not always the case that an error will occur **.

Even if it is said to be censored, it does not necessarily mean that the result will be reduced (truncated) because rounding is performed. The policy depends on the processing system, such as rounding up or moving closer (processing such as rounding).

## Demonstration of error and binary representation

Now. Floating point is defined by the standard IEEE754 today.

• The decimal floating point number received in the comment is also incorporated in the 2008 version of IEEE754-2008, but we will proceed with the binary-based story below.

The most major is 64-bit double precision, but in addition to that, 32-bit single precision, 128-bit quadruple precision, and in machine learning that is popular these days, the precision is lower than single precision (instead, the amount of calculation depending on the hardware). 16bit half precision (which can earn) is also used.

For the internal representation, [Understanding Floating Point Numbers](/ tobira-code / items / 9a000264260a7ef8ca30) is detailed, but in any case, it is based on the following binary representation. $(-1)^s\cdot 1.fff\cdots ff(2)\cdot 2^{e-bias}$

So, I created a C program to see what happens as a binary number based on the internal representation of floating point numbers, and to see how errors are created by hand. This time, with single precision (because it is hard to see if there are many digits). In the environment of Windows10 / WSL + Ubuntu16 + gcc (x86_64) that I tried, the float type is equivalent.

• I think that will be the case in an intel environment.

Here's how it compiles and runs. By giving 100 as an input, the binary display of $\ frac 1 {100} = 0.01$ is output, and then the state of long division is output. (On the right side, there is also a binary display when the equivalent multiplication is calculated.) Finally, the result of the addition is also output as a decimal number.

#### float.c Compile and run


$gcc -std=c99 -o float float.c$ ./float <<< 100
in single precision calculation:
1.0/100 = 0.01(10) =  1.01000111101011100001010(2) * 2^-7

0.000000101000111101011100001010  0.000000101000111101011100001010 (1/100=0.01)
+ 0.000000101000111101011100001010
---------------------------------
0.00000101000111101011100001010   0.00000101000111101011100001010 (2/100=0.02)
+ 0.000000101000111101011100001010
---------------------------------
0.00000111101011100001010001111   0.00000111101011100001010001111 (3/100=0.03)
+ 0.000000101000111101011100001010
---------------------------------
0.0000101000111101011100001010    0.0000101000111101011100001010 (4/100=0.04)
+ 0.000000101000111101011100001010
---------------------------------
0.0000110011001100110011001100    0.0000110011001100110011001101 (5/100=0.05)
…(Omission)
0.111110101110000100111101        0.111110101110000101001000 (98/100=0.98)
+ 0.000000101000111101011100001010
---------------------------------
0.111111010111000010011001        0.111111010111000010100100 (99/100=0.99)
+ 0.000000101000111101011100001010
---------------------------------
0.111111111111111111110101        1.00000000000000000000000 (100/100=1)

1.0/100+1.0/100+...+1.0/100(100times) = 0.99999934


For single precision, the precision is 24 binary digits. Therefore, each time the addition is repeated, the digit difference between the total held and the number to be added will appear, and the error of the cut-off will accumulate. (You can see that the error has already appeared in the last digit at the 5th time) As a result, if you add 0.01 100 times in single precision, it will be 0.99999934, which is slightly less than 1.

## Demo source

Below is the demo source (C99). I think it is also good to paste it around ideone and execute it.

#### float.c


#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <assert.h>

void test(int n);

int main(void) {
int n=0;
int ret=scanf("%d",&n);
if ( ret!=1||n<=0 )
n=100;
test(n);
}

void binform_s1(char *dst,double d);
int binform_s2(char *dst,double d);

void test(int n) {
const float t=1.0/n;

char buf[256];
binform_s1(buf,t);
printf("in single precision calculation:\n"
" 1.0/%d = %g(10) = %s\n\n",n,t,buf);

float s=0;
int c1=binform_s2(buf,t);
char line[256];
memset(line,'-',c1);
line[c1]='\0';

for ( int i=1; i<=n; i++ ) {
s+=t;
if ( i>1 )
printf("+%s\n"
" %s\n",buf,line);
char buf2[256],buf3[256];
int c2=binform_s2(buf2,s);
(void)binform_s2(buf3,(double)i/n);
printf(" %s%*s %s (%d/%d=%g)\n",buf2,c1-c2,"",buf3,i,n,(double)i/n);
}
printf("\n"
" 1.0/%d+1.0/%d+...+1.0/%d(%dtimes) = %.8f\n",n,n,n,n,s);
}

//Below, auxiliary functions
//Integer type by type punning(bit string)Conversion to
static inline int32_t f2i(double d) {
// type punning
union {
float f;
int32_t i;
} u = { .f=d };
return u.i;
}

//Single-precision to binary representation string(mantissa*2^index)Generate a
void binform_s1(char *dst,double d) {
int32_t x=f2i(d);
sprintf(dst,"%c1.%23s(2) * 2^%d",
x>>31?'-':' ',"",((x>>23)&255)-127);
for ( int i=22; i>=0; i-- )
dst[25-i]='0'+(x>>i&1);
}

//Single-precision to binary representation string( 1.xx ... or 0.Generate xx…)
int binform_s2(char *dst,double d) {
int32_t x=f2i(d);
int r=((x>>23)&255)-127;
// support only small floats
assert(r<=0);
dst[0]=x>>31?'-':' ';
memset(dst+1,'0',1-r);
dst[2]='.';
dst[r<0?2-r:1]='1';
for ( int i=22; i>=0; i-- )
dst[25-r-i]='0'+((x>>i)&1);
dst[26-r]='\0';
return 26-r;
}


# Summary

Finally, it is a summary.

• Floating point is numerical data that assumes the occurrence of an error.
• Instead, it can handle a much wider range of numbers than fixed-point numbers.
• In most programming languages, decimals are treated as floating point
• If you don't want the error due to floating point, use the appropriate library data type and think about handling it with integer manipulation.
• Floating point is internally represented as a binary decimal
• Even if it is a decimal number that is humanly sharp (as a decimal number), it will be an infinite decimal number in decimal except for some parts, and the digits will be cut off, causing an error.