update1 2020-01-25: typo fix ʻIEEE745-> ʻIEEE 754
In [1]: from datetime import datetime
In [2]: (datetime(2020, 1, 11) - datetime(2018, 12, 13)).days
Out[2]: 394
I will explain how to handle nan in Python. In the following, the notation of nan as a concept is referred to as NaN.
Disclaimer: This post is for justInCase Advent Calendar 2018 and was posted after a period of about 400 days, but due to the maturity period, the content Is not fulfilling.
--NaN in Python follows IEEE 754 NaN, but there are some addictive points.
--Note the existence of Decimal ('nan')
, pd.NaT
,numpy.datetime64 ('NaT')
, which are not float nan.
--The nan object and math.nan that can be called from numpy and pandas modules are the same. You can use any of them. (But it is better to unify from the viewpoint of readability)
--Note that pandas' ʻisna (...) method returns True as a missing value not only for nan but also for
None,
pd.NaT, etc. --The missing value of pandas is that
pd.NAwill be introduced from pandas 1.0.0. It is desirable to use
pd.NA` instead of nan as the missing value in the future. I will write about this in another article.
[Verification environment is described at the end](#Verification environment)
Please refer to the previous article What is NaN? NaN Zoya (R).
Note that quiet NaN propagates in general numerical operations, but what do you think the following two values should return? In fact, the handling of NaN at min and max changes between IEEE 754-2008 and IEEE 754-2019. The explanation of is in another article.
min(1.0, float('nan'))
max(1.0, float('nan'))
There are no language literals. If you are calling float ('nan')
or numpy, which does not require a module call, np.nan
tends to be used a lot.
import math
import decimal
import numpy as np
import pandas as pd
float('nan')
math.nan
0.0 * math.inf
math.inf / math.inf
# 0.0/0.0 ZeroDivisionError in Python. C, R,Many languages, such as julia, return NaN
np.nan
np.core.numeric.NaN
pd.np.nan
All float objects. Objects that are not singleton objects but are referenced by numpy`` pandas
are the same.
nans = [float('nan'), math.nan, 0 * math.inf, math.inf / math.inf, np.nan, np.core.numeric.NaN, pd.np.nan]
import pprint
pprint.pprint([(type(n), id(n)) for n in nans])
# [(<class 'float'>, 4544450768),
# (<class 'float'>, 4321186672),
# (<class 'float'>, 4544450704),
# (<class 'float'>, 4544450832),
# (<class 'float'>, 4320345936),
# (<class 'float'>, 4320345936),
# (<class 'float'>, 4320345936)]
float ('nan')
itself is an immutable object of float class, so hashable. So it can be a dictionary key, but strangely it allows you to add multiple nans. And if you don't bind the key to a variable in advance, you can't retrieve it again. This is thought to be because all the results of the comparison operator of NaN
are False
, that is, float ('nan') == float ('nan')
-> False
.
>>> d = {float('nan'): 1, float('nan'): 2, float('nan'): 3}
>>> d
{nan: 1, nan: 2, nan: 3}
>>> d[float('nan')]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: nan
Note the existence of objects with NaN-like properties that are not float classes. In particular, pd.NaT
andnp.datetime64 ("NaT")
are different classes.
decimal.Decimal('nan')
pd.NaT
np.datetime64("NaT")
# >>> type(decimal.Decimal('nan'))
# <class 'decimal.Decimal'>
# >>> type(pd.NaT)
# <class 'pandas._libs.tslibs.nattype.NaTType'>
# >>> type(np.datetime64("NaT"))
# <class 'numpy.datetime64'>
Therefore, the following precautions are required when using np.isnat
.
>>> np.isnat(pd.NaT)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: ufunc 'isnat' is only defined for datetime and timedelta.
>>> np.isnat(np.datetime64("NaT"))
True
math.isnan
np.isnan
pd.isna
The actual situation of math.isnan
is around here.
https://github.com/python/cpython/blob/e42b705188271da108de42b55d9344642170aa2b/Include/pymath.h#L88-L103
https://github.com/python/cpython/blob/34fd4c20198dea6ab2fe8dc6d32d744d9bde868d/Lib/_pydecimal.py#L713-L726
/* Py_IS_NAN(X)
* Return 1 if float or double arg is a NaN, else 0.
* Caution:
* X is evaluated more than once.
* This may not work on all platforms. Each platform has *some*
* way to spell this, though -- override in pyconfig.h if you have
* a platform where it doesn't work.
* Note: PC/pyconfig.h defines Py_IS_NAN as _isnan
*/
#ifndef Py_IS_NAN
#if defined HAVE_DECL_ISNAN && HAVE_DECL_ISNAN == 1
#define Py_IS_NAN(X) isnan(X)
#else
#define Py_IS_NAN(X) ((X) != (X))
#endif
#endif
def _isnan(self):
"""Returns whether the number is not actually one.
0 if a number
1 if NaN
2 if sNaN
"""
if self._is_special:
exp = self._exp
if exp == 'n':
return 1
elif exp == 'N':
return 2
return 0
Note that pandas' ʻisna method (and also ʻisnull
) returns True
as missing values for None
and pd.NaT
as well as float nan
.
If pandas.options.mode.use_inf_as_na = True
, there is a tip that np.inf
is also judged as a missing value.
>>> pd.isna(math.nan)
True
>>> pd.isna(None)
True
>>> pd.isna(math.inf)
False
>>> pandas.options.mode.use_inf_as_na = True
>>> pd.isna(math.inf)
True
The direct method of pandas object takes scalar or array-like as an argument, and the return value is a bool of the same size as the argument. On the other hand, the direct method of pd.DataFrame is DataFrame for both arguments and return value.
pd.isna # for scalar or array-like
pd.DataFrame.isna # for DataFrame
The array-like object specifically refers to the following object. (https://github.com/pandas-dev/pandas/blob/v0.25.3/pandas/core/dtypes/missing.py#L136-L147)
ABCSeries,
np.ndarray,
ABCIndexClass,
ABCExtensionArray,
ABCDatetimeArray,
ABCTimedeltaArray,
It should be noted, may be either for pd.isna
and pd.isnull
is exactly the same (use unified from the point of view of readability is desirable).
# https://github.com/pandas-dev/pandas/blob/v0.25.3/pandas/core/dtypes/missing.py#L125
>>> id(pd.isnull)
4770964688
>>> id(pd.isna)
4770964688
If you don't want to encounter an unexpected error, pd.isna
is safe, but be careful as it will leak to the judgment ofDecimal ('nan')
.
math.nan | decimal.Decimal('nan') | np.datetime64("NaT") | pd.NaT | math.inf | None | |
---|---|---|---|---|---|---|
math.isnan | True | True | error | error | False | error |
decimal.Decimal.is_nan | error | True | error | error | error | error |
np.isnan | True | error | True | error | False | error |
pd.isna | True | False | True | True | False | True |
np.isnat | error | error | True | error | error | error |
Check the binary expression. You can see that it is quiet NaN.
>>> import struct
>>> xs = struct.pack('>d', math.nan)
>>> xs
b'\x7f\xf8\x00\x00\x00\x00\x00\x00'
>>> xn = struct.unpack('>Q', xs)[0]
>>> xn
9221120237041090560
>>> bin(xn)
'0b111111111111000000000000000000000000000000000000000000000000000'
--NaN in Python follows IEEE754 NaN, but there are some addictive points.
--Note the existence of Decimal ('nan')
, pd.NaT
,numpy.datetime64 ('NaT')
, which are not float nan.
--The nan object and math.nan that can be called from numpy and pandas modules are the same. You can use any of them. (But it is better to unify from the viewpoint of readability)
--Note that pandas' ʻisna (...) method returns True not only for nan but also for None, NaT, etc. as missing values. --Pandas missing values are introduced in
pd.NAfrom pandas 1.0.0. It will be used in the future to use
pd.NA` instead of nan as the missing value. I will write about this in another article.
Finally If you love this kind of maniac story, please come visit us at justInCase. https://www.wantedly.com/companies/justincase
that's all
$ uname -a
Darwin MacBook-Pro-3.local 18.7.0 Darwin Kernel Version 18.7.0: Sat Oct 12 00:02:19 PDT 2019; root:xnu-4903.278.12~1/RELEASE_X86_64 x86_64
$ python
Python 3.7.4 (default, Nov 17 2019, 08:06:12)
[Clang 10.0.1 (clang-1001.0.46.4)] on darwin
$ pip list | grep -e numpy -e pandas
numpy 1.18.0
pandas 0.25.3
Recommended Posts