[RUBY] What is a floating point?


Recently (November 2018), Togetter summary [It was written in BASIC 30 years ago that "If you add 0.01 10,000 times, it will be 100.003", so when I tried it in the current environment, the result was the same](https //togetter.com/li/1289806) became a hot topic.

For example, in a similar story, adding 0.01 100 times in Ruby (x86_64-Linux) does not result in exactly 1, and there is a slight error (see the following example). The situation will be similar for other programming languages.

Ruby(x86_64-Linux)With 0.When 01 is added 100 times

$ irb
irb(main):001:0> 100.times.reduce(0){|s,_| s+0.01 }
=> 1.0000000000000007

In the summary of the case, there is an opinion that "in this era ...", but ** what is floating point ** For those who know it, even if it is Atari Mae's knowledge, otherwise ** It looks like mere decimal data **, and if you are not good at it, you often get the impression that "there is a gap in the calculation!".

… Originally, I think that you should acquire knowledge in introductory books (because it is a type of data that is standard in most languages), but is there a proper explanation? I also think, so I decided to write a rough article.

What is a floating point?

Fixed point and floating point

In a nutshell, floating point is not a decimal number that the average person would imagine, but ** numerical data that assumes that an error will occur in the calculation **. Conversely, ** when using floating point, be aware that it is an approximate calculation, and consider the accuracy of the result **.

With that alone, you may be wondering why you aren't doing that cool, but it's ** computing with finite resources ** the fate of a computer, and the point of where to compromise. Is the difference.

The opposite of floating point is ** fixed point **, which is generally integer data. This can always be handled in increments of 1, and as long as you handle integers, there will be no error, but ** the range of numbers that can be handled is not very wide **.

Floating point, on the other hand, can handle a very wide range of numbers ** at the cost of error. From nano-level numbers close to 0, Avogadro constant (recently revised definition) and space-scale numbers are also available.

And in most programming languages, it's common to handle decimal data in floating point.

Why decimals in floating point?

That said, some people may think this. "No, I don't need such nano or space. Even though I handle about 0.01 at most, please forgive me for errors."

As a matter of fact, if you only deal with decimal numbers that appear in currencies and tax rates, that feeling is plausible.

But then, how many decimal places should a computer cover? And here, you should have already learned in elementary school. For example, ** Pi should be infinite decimal ** (3.14 is just an approximation). Even if you don't bring out the pi, the irrational number $ \ sqrt {2} $ that Pythagoras tried to seal the existence is also an infinite decimal number, so ** how many digits should I look at **, this is It always comes with decimal calculations. In the end, since it cannot be expressed as a finite value, we have to stop it somewhere, and we have to consider the calculation accuracy and error. As with so-called scientific computing, this problem is unavoidable in statistics, graphic processing, etc.

So, this is just my guess, but it seems that integers are fixed-point and decimals are floating-point.

If you still hate errors

That said, there may be times when you don't want to think about errors. If so, there should be a language or library that can handle decimal points without error.

For example, in legacy and notorious COBOL, the format packed decimal allows for error-free calculations with decimal numbers in the specified number of digits. It was only said that it was for paperwork. In other languages, for example, you have the option of using BigDecimal for Java, decimal for C #, and BigDecimal for Ruby (or Rational for integers divided by integers).

Or, for example, if you are talking about multiplying the consumption tax rate by 8% (though floating point does not cause much error to that extent), you can simply replace it with an integer arithmetic. Like (amount) * 8/100 instead of (amount) * 0.08. Alternatively, instead of expressing $ 1.50 as $ 1.50, you could devise a unit of 150 cents so that it fits in the range of integers.

The point is, ** Don't treat it as a decimal in the program just because it's a decimal in the real world, think first **.

Mechanism of error

Binary representation

However, on the other hand, you may be wondering, "Why does an error occur with a non-trivial number of 0.01?"

It is still easy to understand that the accuracy that can be expressed by adding a very large number and a small number, such as the following, did not fit. (Ruby (x86_64-Linux) treats it as double precision floating point, so the precision is about 16 digits in decimal)

An example of an error that does not fit within the accuracy in Ruby

$ irb
irb(main):001:0> 100_000_000 + 0.000_000_01 #Fits in accuracy
=> 100000000.00000001
irb(main):002:0> 1000_000_000 + 0.000_000_001 #Do not fit
=> 1000000000.0

However, adding 0.01 100 times to 1 does not mean that there is a difference in size ...?

In this case of "adding 0.01 100 times does not become 1", the computer internally treats floating point numbers as binary numbers, and ** 0.01 becomes an infinite (recurring) decimal number in binary numbers **. It is the cause.

Note that decimal decimals and binary decimals have the following correspondence: $ \begin{eqnarray} 0.5&(10)=&0.1&(2) \\\\ 0.25&(10)=&0.01&(2) \\\\ ~ &~&\vdots&~ \\\\ 0.015625&(10)=&0.000001&(2) \\\\ 0.0078125&(10)=&0.0000001&(2) \\\\ 0.00390625&(10)=&0.00000001&(2) \\\\ 0.001953125&(10)=&0.000000001&(2) \\\\ ~ &~&\vdots&~ \\\\ \end{eqnarray} $

Then, you can see (calculate) that 0.01 is a binary recurring decimal with a 20-digit cycle as follows. $ 0.01(10)=0.000000101000111101011100001010001111010111\cdots(2) $ Therefore, on a computer, the digits are cut off in the middle and saved. This truncated digit affects as an error as the calculation is repeated.

Conversely, ** binary sharpness ** (becomes a finite decimal number), a fraction whose denominator is a power of 2 (for example, $ \ frac 1 {128} = 0.0078125 (10) = 0.000000001 (2) $ ) Does not create an error. ** It is necessary to be careful to handle it on the assumption that there is an error, but it is not always the case that an error will occur **.

Even if it is said to be censored, it does not necessarily mean that the result will be reduced (truncated) because rounding is performed. The policy depends on the processing system, such as rounding up or moving closer (processing such as rounding).

Demonstration of error and binary representation

Now. Floating point is defined by the standard IEEE754 today.

The most major is 64-bit double precision, but in addition to that, 32-bit single precision, 128-bit quadruple precision, and in machine learning that is popular these days, the precision is lower than single precision (instead, the amount of calculation depending on the hardware). 16bit half precision (which can earn) is also used.

For the internal representation, [Understanding Floating Point Numbers](/ tobira-code / items / 9a000264260a7ef8ca30) is detailed, but in any case, it is based on the following binary representation. $ (-1)^s\cdot 1.fff\cdots ff(2)\cdot 2^{e-bias} $

So, I created a C program to see what happens as a binary number based on the internal representation of floating point numbers, and to see how errors are created by hand. This time, with single precision (because it is hard to see if there are many digits). In the environment of Windows10 / WSL + Ubuntu16 + gcc (x86_64) that I tried, the float type is equivalent.

Here's how it compiles and runs. By giving 100 as an input, the binary display of $ \ frac 1 {100} = 0.01 $ is output, and then the state of long division is output. (On the right side, there is also a binary display when the equivalent multiplication is calculated.) Finally, the result of the addition is also output as a decimal number.

float.c Compile and run

$ gcc -std=c99 -o float float.c
$ ./float <<< 100
in single precision calculation:
 1.0/100 = 0.01(10) =  1.01000111101011100001010(2) * 2^-7

continuous binary column additions

  0.000000101000111101011100001010  0.000000101000111101011100001010 (1/100=0.01)
+ 0.000000101000111101011100001010
  0.00000101000111101011100001010   0.00000101000111101011100001010 (2/100=0.02)
+ 0.000000101000111101011100001010
  0.00000111101011100001010001111   0.00000111101011100001010001111 (3/100=0.03)
+ 0.000000101000111101011100001010
  0.0000101000111101011100001010    0.0000101000111101011100001010 (4/100=0.04)
+ 0.000000101000111101011100001010
  0.0000110011001100110011001100    0.0000110011001100110011001101 (5/100=0.05)
  0.111110101110000100111101        0.111110101110000101001000 (98/100=0.98)
+ 0.000000101000111101011100001010
  0.111111010111000010011001        0.111111010111000010100100 (99/100=0.99)
+ 0.000000101000111101011100001010
  0.111111111111111111110101        1.00000000000000000000000 (100/100=1)

 1.0/100+1.0/100+...+1.0/100(100times) = 0.99999934

For single precision, the precision is 24 binary digits. Therefore, each time the addition is repeated, the digit difference between the total held and the number to be added will appear, and the error of the cut-off will accumulate. (You can see that the error has already appeared in the last digit at the 5th time) As a result, if you add 0.01 100 times in single precision, it will be 0.99999934, which is slightly less than 1.

Demo source

Below is the demo source (C99). I think it is also good to paste it around ideone and execute it.


#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <assert.h>

void test(int n);

int main(void) {
    int n=0;
    int ret=scanf("%d",&n);
    if ( ret!=1||n<=0 )

void binform_s1(char *dst,double d);
int binform_s2(char *dst,double d);

void test(int n) {
    const float t=1.0/n;

    char buf[256];
    printf("in single precision calculation:\n"
           " 1.0/%d = %g(10) = %s\n\n",n,t,buf);

    float s=0;
    int c1=binform_s2(buf,t);
    char line[256];

    printf("continuous binary column additions\n\n");

    for ( int i=1; i<=n; i++ ) {
        if ( i>1 )
                   " %s\n",buf,line);
        char buf2[256],buf3[256];
        int c2=binform_s2(buf2,s);
        printf(" %s%*s %s (%d/%d=%g)\n",buf2,c1-c2,"",buf3,i,n,(double)i/n);
           " 1.0/%d+1.0/%d+...+1.0/%d(%dtimes) = %.8f\n",n,n,n,n,s);

//Below, auxiliary functions
//Integer type by type punning(bit string)Conversion to
static inline int32_t f2i(double d) {
    // type punning
    union {
        float f;
        int32_t i;
    } u = { .f=d };
    return u.i;

//Single-precision to binary representation string(mantissa*2^index)Generate a
void binform_s1(char *dst,double d) {
    int32_t x=f2i(d);
    sprintf(dst,"%c1.%23s(2) * 2^%d",
            x>>31?'-':' ',"",((x>>23)&255)-127);
    for ( int i=22; i>=0; i-- )

//Single-precision to binary representation string( 1.xx ... or 0.Generate xx…)
int binform_s2(char *dst,double d) {
    int32_t x=f2i(d);
    int r=((x>>23)&255)-127;
    // support only small floats
    dst[0]=x>>31?'-':' ';
    for ( int i=22; i>=0; i-- )
    return 26-r;


Finally, it is a summary.

Recommended Posts

What is a floating point?
What is a constructor?
What is a stream
What is a Servlet?
What is a wrapper class?
What is a boolean type?
What is a meaningful comment?
What is a jar file?
What is a Java collection?
What is a lambda expression?
What is a fa⁉ enum?
What is a snippet in programming?
What is a column Boolean type?
What is a reference type variable?
What is a lambda expression (Java)
[Swift] What is "inheriting a class"?
What is a Ruby 2D array?
What is Cubby
What is null? ]
What is a Spring Boot .original file?
What is java
What is Keycloak
What is maven?
What is Jackson?
[For programming beginners] What is a method?
What is Docker
What is self
What is Jenkins
What is ArgumentMatcher?
What is IM-Juggling?
What is a class in Java language (1 /?)
What is params
What is SLF4J?
What is a class in Java language (2 /?)
What is Facade? ??
What is Java <>?
What is Gradle?
What is POJO
What is Java
What is centOS
What is RubyGem?
[Rails] What is a dot (.) Or a colon (:)?
What is programming?
What is before_action?
What is Docker
What is a request scope? Image commentary
What is Byte?
What is Tomcat
Introduction to Recursive Functions: What is a Recursive Function?
What is Maven Assembly?
What is `docker-compose up`?
What is vue cli
What is an interface?
What is Ruby's self?
What is hard coding?
What is Ruby's attr_accessor?
What is Java Encapsulation?
What is permission denied?
What is instance control?
What is an initializer?
What is Spring Tools 4