update1 2020-01-25: typo fix ʻIEEE745-> ʻIEEE 754

In [1]: from datetime import datetime  
In [2]: (datetime(2020, 1, 11) - datetime(2018, 12, 13)).days                           
Out[2]: 394

I will explain how to handle nan in Python. In the following, the notation of nan as a concept is referred to as NaN.

Disclaimer: This post is for justInCase Advent Calendar 2018 and was posted after a period of about 400 days, but due to the maturity period, the content Is not fulfilling.

wrap up

--NaN in Python follows IEEE 754 NaN, but there are some addictive points. --Note the existence of Decimal ('nan'), pd.NaT,numpy.datetime64 ('NaT'), which are not float nan. --The nan object and math.nan that can be called from numpy and pandas modules are the same. You can use any of them. (But it is better to unify from the viewpoint of readability) --Note that pandas' ʻisna (...) method returns True as a missing value not only for nan but also for None, pd.NaT, etc. --The missing value of pandas is that pd.NAwill be introduced from pandas 1.0.0. It is desirable to usepd.NA` instead of nan as the missing value in the future. I will write about this in another article.

https://mobile.twitter.com/jorisvdbossche/status/1208476049690046465
https://dev.pandas.io/docs/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values

[Verification environment is described at the end](#Verification environment)

Handling of NaN in IEEE754

Please refer to the previous article What is NaN? NaN Zoya (R).

Note that quiet NaN propagates in general numerical operations, but what do you think the following two values should return? In fact, the handling of NaN at min and max changes between IEEE 754-2008 and IEEE 754-2019. The explanation of is in another article.

min(1.0, float('nan'))
max(1.0, float('nan'))

How to call NaN in Python

There are no language literals. If you are calling float ('nan') or numpy, which does not require a module call, np.nan tends to be used a lot.

import math
import decimal
import numpy as np
import pandas as pd

float('nan')
math.nan
0.0 * math.inf
math.inf / math.inf
# 0.0/0.0 ZeroDivisionError in Python. C, R,Many languages, such as julia, return NaN
np.nan
np.core.numeric.NaN
pd.np.nan

All float objects. Objects that are not singleton objects but are referenced by numpy`` pandas are the same.

nans = [float('nan'), math.nan, 0 * math.inf, math.inf / math.inf, np.nan, np.core.numeric.NaN, pd.np.nan]

import pprint
pprint.pprint([(type(n), id(n)) for n in nans])
# [(<class 'float'>, 4544450768),
#  (<class 'float'>, 4321186672),
#  (<class 'float'>, 4544450704),
#  (<class 'float'>, 4544450832),
#  (<class 'float'>, 4320345936),
#  (<class 'float'>, 4320345936),
#  (<class 'float'>, 4320345936)]

float ('nan') itself is an immutable object of float class, so hashable. So it can be a dictionary key, but strangely it allows you to add multiple nans. And if you don't bind the key to a variable in advance, you can't retrieve it again. This is thought to be because all the results of the comparison operator of NaN are False, that is, float ('nan') == float ('nan')-> False.

>>> d = {float('nan'): 1, float('nan'): 2, float('nan'): 3}
>>> d
{nan: 1, nan: 2, nan: 3}
>>> d[float('nan')]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: nan

Note the existence of objects with NaN-like properties that are not float classes. In particular, pd.NaT andnp.datetime64 ("NaT")are different classes.

decimal.Decimal('nan')
pd.NaT
np.datetime64("NaT")

# >>> type(decimal.Decimal('nan'))
# <class 'decimal.Decimal'>

# >>> type(pd.NaT)
# <class 'pandas._libs.tslibs.nattype.NaTType'>

# >>> type(np.datetime64("NaT"))
# <class 'numpy.datetime64'>

Therefore, the following precautions are required when using np.isnat.

>>> np.isnat(pd.NaT)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ufunc 'isnat' is only defined for datetime and timedelta.

>>> np.isnat(np.datetime64("NaT"))
True

NaN check

math.isnan
np.isnan
pd.isna

The actual situation of math.isnan is around here. https://github.com/python/cpython/blob/e42b705188271da108de42b55d9344642170aa2b/Include/pymath.h#L88-L103 https://github.com/python/cpython/blob/34fd4c20198dea6ab2fe8dc6d32d744d9bde868d/Lib/_pydecimal.py#L713-L726

/* Py_IS_NAN(X)
 * Return 1 if float or double arg is a NaN, else 0.
 * Caution:
 *     X is evaluated more than once.
 *     This may not work on all platforms.  Each platform has *some*
 *     way to spell this, though -- override in pyconfig.h if you have
 *     a platform where it doesn't work.
 * Note: PC/pyconfig.h defines Py_IS_NAN as _isnan
 */
#ifndef Py_IS_NAN
#if defined HAVE_DECL_ISNAN && HAVE_DECL_ISNAN == 1
#define Py_IS_NAN(X) isnan(X)
#else
#define Py_IS_NAN(X) ((X) != (X))
#endif
#endif

def _isnan(self):
    """Returns whether the number is not actually one.
    0 if a number
    1 if NaN
    2 if sNaN
    """
    if self._is_special:
        exp = self._exp
        if exp == 'n':
            return 1
        elif exp == 'N':
            return 2
    return 0

Note that pandas' ʻisna method (and also ʻisnull) returns True as missing values for None and pd.NaT as well as float nan. If pandas.options.mode.use_inf_as_na = True, there is a tip that np.inf is also judged as a missing value.

>>> pd.isna(math.nan)
True
>>> pd.isna(None)
True

>>> pd.isna(math.inf)
False
>>> pandas.options.mode.use_inf_as_na = True
>>> pd.isna(math.inf)
True

About pandas method

The direct method of pandas object takes scalar or array-like as an argument, and the return value is a bool of the same size as the argument. On the other hand, the direct method of pd.DataFrame is DataFrame for both arguments and return value.

pd.isna # for scalar or array-like
pd.DataFrame.isna # for DataFrame

The array-like object specifically refers to the following object. (https://github.com/pandas-dev/pandas/blob/v0.25.3/pandas/core/dtypes/missing.py#L136-L147)

ABCSeries,
np.ndarray,
ABCIndexClass,
ABCExtensionArray,
ABCDatetimeArray,
ABCTimedeltaArray,

It should be noted, may be either for pd.isna and pd.isnull is exactly the same (use unified from the point of view of readability is desirable).

# https://github.com/pandas-dev/pandas/blob/v0.25.3/pandas/core/dtypes/missing.py#L125
>>> id(pd.isnull)
4770964688
>>> id(pd.isna)
4770964688

is method summary

If you don't want to encounter an unexpected error, pd.isna is safe, but be careful as it will leak to the judgment ofDecimal ('nan').

	math.nan	decimal.Decimal('nan')	np.datetime64("NaT")	pd.NaT	math.inf	None
math.isnan	True	True	error	error	False	error
decimal.Decimal.is_nan	error	True	error	error	error	error
np.isnan	True	error	True	error	False	error
pd.isna	True	False	True	True	False	True
np.isnat	error	error	True	error	error	error

Other

Check the binary expression. You can see that it is quiet NaN.

>>> import struct
>>> xs = struct.pack('>d', math.nan)
>>> xs
b'\x7f\xf8\x00\x00\x00\x00\x00\x00'
>>> xn = struct.unpack('>Q', xs)[0]
>>> xn
9221120237041090560
>>> bin(xn)
'0b111111111111000000000000000000000000000000000000000000000000000'

Summary (repost)

--NaN in Python follows IEEE754 NaN, but there are some addictive points. --Note the existence of Decimal ('nan'), pd.NaT,numpy.datetime64 ('NaT'), which are not float nan. --The nan object and math.nan that can be called from numpy and pandas modules are the same. You can use any of them. (But it is better to unify from the viewpoint of readability) --Note that pandas' ʻisna (...) method returns True not only for nan but also for None, NaT, etc. as missing values. --Pandas missing values are introduced in pd.NAfrom pandas 1.0.0. It will be used in the future to usepd.NA` instead of nan as the missing value. I will write about this in another article.

https://mobile.twitter.com/jorisvdbossche/status/1208476049690046465
https://dev.pandas.io/docs/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values

Finally If you love this kind of maniac story, please come visit us at justInCase. https://www.wantedly.com/companies/justincase

that's all

Verification environment

$ uname -a
Darwin MacBook-Pro-3.local 18.7.0 Darwin Kernel Version 18.7.0: Sat Oct 12 00:02:19 PDT 2019; root:xnu-4903.278.12~1/RELEASE_X86_64 x86_64

$ python
Python 3.7.4 (default, Nov 17 2019, 08:06:12) 
[Clang 10.0.1 (clang-1001.0.46.4)] on darwin

$ pip list | grep -e numpy -e pandas
numpy                    1.18.0     
pandas                   0.25.3

What is NaN? NaN Zoya (Python) (394 days late)