update1 2020-01-25: Added that bug-like behavior was bug
As of 2020-01-13, pandas 1.0.0rc0 has been released, but one of the major features is the introduction of pd.NA
as a missing value. I will summarize this property and how to use it.
Disclaimer: It has been confirmed to work with pandas 1.0.0 rc0, and there is a good possibility that it will change in the future.
Finally, [Verification Environment](#Verification Environment).
--pd.NA
appears as the meaning of missing value.
--pd.NA
can be used with IntegerArray, BooleanArray, StringArray
--With the introduction of pd.NA
, missing value can be expressed in int class as well (no careless conversion to float).
--pd.NA
is a singleton object and is consistent with all data types.
--All comparison operator return values for pd.NA
are pd.NA
(same behavior as Julia's missing
object, R's NA
))
--Operations with logical operators follow the so-called three-valued logic
--In pd.read_csv ()
, NA is recognized by specifying ʻInt64,
string,
boolean. (
booleandoesn't work in rc0 and is dealing with issues). ~~ can be specified, but
boolean` will result in an error. It's unclear if this behavior is a bug or a spec. (Probably specifications) ~~
data type
A new class called NAType
is introduced in pandas. The purpose is to indicate the value as a missing value.
>>> import pandas as pd
>>> pd.NA
<NA>
>>> type(pd.NA)
<class 'pandas._libs.missing.NAType'>
In pd.Series and pd.DataFrame, if you do not specify a type, it is treated as an object type, and if you specify it, it is treated as that type. ʻInt64Dtype is Nullable interger (An ExtensionDtype for int64 integer data. Array of integer (optional missing) values) introduced from pandas 0.24. Note that you must specify dtype in uppercase as ʻInt64
instead of ʻint64. Technically, the introduction of
Pandas Extension Arrays` made it possible to use ExtensionDType.
>>> pd.Series([pd.NA]).dtype
dtype('O') # O means Object
#You can specify dtype either as a string alias or as type itself. The following is specified as a character string.
>>> pd.Series([pd.NA], dtype="Int64").dtype
Int64Dtype()
>>> pd.Series([pd.NA], dtype="boolean").dtype
BooleanDtype
>>> pd.Series([pd.NA], dtype="string").dtype
StringDtype
Click here for the implementation of NAType
.
https://github.com/pandas-dev/pandas/blob/493363ef60dd9045888336b5c801b2a3d00e976d/pandas/_libs/missing.pyx#L335-L485
Interestingly, the hash value is defined by 2 ** 61 --1 == 2305843009213693951
. There is no problem because the key of the dictionary does not collide. It's not related to pd.NA
, but in fact, the hash of the integer value of python goes around with 2 ** 61 --1
.
>>> hash(pd.NA) == 2 ** 61 -1
True
>>> {pd.NA: "a", 2305843009213693951: "b"}
{<NA>: 'a', 2305843009213693951: 'b'}
>>> (hash(2**61 - 2), hash(2**61 - 1), hash(2**61))
(2305843009213693950, 0, 1)
Type determination must be specified in uppercase ʻInt64 instead of ʻint64
.
>>> pd.Series([1, 2]) + pd.Series([pd.NA, pd.NA])
0 <NA>
1 <NA>
dtype: object
>>> pd.Series([1, 2]) + pd.Series([pd.NA, pd.NA], dtype="Int64")
0 <NA>
1 <NA>
dtype: Int64
Specifying ʻint64` will result in an error.
>>> pd.Series([pd.NA], dtype="int64").dtype
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.7/site-packages/pandas/core/series.py", line 304, in __init__
data = sanitize_array(data, index, dtype, copy, raise_cast_failure=True)
File "/usr/local/lib/python3.7/site-packages/pandas/core/construction.py", line 438, in sanitize_array
subarr = _try_cast(data, dtype, copy, raise_cast_failure)
File "/usr/local/lib/python3.7/site-packages/pandas/core/construction.py", line 535, in _try_cast
subarr = maybe_cast_to_integer_array(arr, dtype)
File "/usr/local/lib/python3.7/site-packages/pandas/core/dtypes/cast.py", line 1502, in maybe_cast_to_integer_array
casted = np.array(arr, dtype=dtype, copy=copy)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NAType'
The result of the operation with boolean is the same behavior as Julia's missing
and R's NA
.
>>> pd.Series([True, False, pd.NA]) & True
0 True
1 False
2 NA
dtype: bool
>>> pd.Series([True, False, pd.NA]) | True
0 True
1 True
2 True
dtype: bool
>>> pd.NA & True
NA
>>> pd.NA & False
False
>>> pd.NA | True
True
>>> pd.NA | False
NA
>>> pd.Series([1, 2, pd.NA], dtype="Int64")
0 1
1 2
2 <NA>
dtype: Int64
>>> pd.Series([True, False, pd.NA], dtype="boolean")
0 True
1 False
2 <NA>
dtype: boolean
The result of the sum operation is NA
propagated (propagate), butpd.Series.sum ()
with no arguments is treated as 0
and not propagated. It is necessary to specify sum (skipna = False)
to handle it as propagate. However, when specifying the type of 'Int64'
, np.nan
was output instead of NA
.
~~ I searched for issues to see if this was the expected behavior or a bug, but it was unclear. So I recklessly created an issue ticket. ~~ issue ticket Imported. It seems to be reflected in rc1.
>>> sum([1, pd.NA])
<NA>
# pd.Series object
>>> pd.Series([1, pd.NA])
0 1
1 <NA>
dtype: object
>>> pd.Series([1, pd.NA]).sum()
1
>>> pd.Series([1, pd.NA]).sum(skipna=False)
<NA>
# pd.Series Int64
>>> pd.Series([1, pd.NA], dtype='Int64')
0 1
1 <NA>
dtype: Int64
>>> pd.Series([1, pd.NA], dtype='Int64').sum()
1
>>> pd.Series([1, pd.NA], dtype='Int64').sum(skipna=False)
nan
The treatment of exponentiation is consistent with R's NA_integer_
. The behavior of julia is a mystery.
>>> pd.NA ** 0
1
>>> 1 ** pd.NA
1
>>> -1 ** pd.NA
-1
> R.version.string
[1] "R version 3.6.1 (2019-07-05)"
> NA_integer_ ^ 0L
[1] 1
> 1L ^ NA_integer_
[1] 1
> -1L ^ NA_integer_
[1] -1
julia> VERSION
v"1.3.1"
julia> missing ^ 0
missing
julia> 1 ^ missing
missing
julia> -1 ^ missing
missing
Experiment with the following csv file. (test.csv
)
X_int,X_bool,X_string
1,True,"a"
2,False,"b"
NA,NA,"NA"
If dtype is not specified, the behavior is the same as pandas 0.25.3.
>>> df1 = pd.read_csv("test.csv")
>>> df1
X_int X_bool X_string
0 1.0 True a
1 2.0 False b
2 NaN NaN NaN
>>> df1.dtypes
X_int float64
X_bool object
X_string object
dtype: object
ʻInt64and
string` can be specified for dtype.
#dtype can be the following type class instead of character literals.
# df2 = pd.read_csv("test.csv", dtype={'X_int': pd.Int64Dtype(), 'X_string': pd.StringDtype()})
>>> df2 = pd.read_csv("test.csv", dtype={'X_int': 'Int64', 'X_string': 'string'})
>>> df2
X_int X_bool X_string
0 1 True a
1 2 False b
2 <NA> NaN <NA>
>>> df2.dtypes
X_int Int64
X_bool object
X_string string
dtype: object
On the other hand, even if 'boolean'`` pd.BooleanDtype ()
is specified, reading as boolean NA fails. Of course, specifying 'bool'
is also an error. issue When I reported it, it was successfully imported. It seems to work fine with rc1.
>>> df3 = pd.read_csv("test.csv", dtype={'X_bool': 'boolean'})
Traceback (most recent call last):
File "pandas/_libs/parsers.pyx", line 1191, in pandas._libs.parsers.TextReader._convert_with_dtype
File "/usr/local/lib/python3.7/site-packages/pandas/core/arrays/base.py", line 232, in _from_sequence_of_strings
raise AbstractMethodError(cls)
pandas.errors.AbstractMethodError: This method must be defined in the concrete class type
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
data = parser.read(nrows)
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1133, in read
ret = self._engine.read(nrows)
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2037, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 859, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 874, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 951, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 1083, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas/_libs/parsers.pyx", line 1114, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas/_libs/parsers.pyx", line 1194, in pandas._libs.parsers.TextReader._convert_with_dtype
NotImplementedError: Extension Array: <class 'pandas.core.arrays.boolean.BooleanArray'> must implement _from_sequence_of_strings in order to be used in parser methods
>>> df3 = pd.read_csv("test.csv", dtype={'X_bool': pd.BooleanDtype()})
Traceback (most recent call last):
File "pandas/_libs/parsers.pyx", line 1191, in pandas._libs.parsers.TextReader._convert_with_dtype
File "/usr/local/lib/python3.7/site-packages/pandas/core/arrays/base.py", line 232, in _from_sequence_of_strings
raise AbstractMethodError(cls)
pandas.errors.AbstractMethodError: This method must be defined in the concrete class type
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
data = parser.read(nrows)
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1133, in read
ret = self._engine.read(nrows)
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2037, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 859, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 874, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 951, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 1083, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas/_libs/parsers.pyx", line 1114, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas/_libs/parsers.pyx", line 1194, in pandas._libs.parsers.TextReader._convert_with_dtype
NotImplementedError: Extension Array: <class 'pandas.core.arrays.boolean.BooleanArray'> must implement _from_sequence_of_strings in order to be used in parser methods
>>> df3 = pd.read_csv("test.csv", dtype={'X_bool': 'bool'})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
data = parser.read(nrows)
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1133, in read
ret = self._engine.read(nrows)
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2037, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 859, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 874, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 951, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 1083, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas/_libs/parsers.pyx", line 1114, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas/_libs/parsers.pyx", line 1231, in pandas._libs.parsers.TextReader._convert_with_dtype
ValueError: Bool column has NA values in column 1
(Impression) All the culprit was that the missing value did not exist in numpy. So there are various contradictions in the introduction at the pandas layer. Various things are written here. https://dev.pandas.io/docs/user_guide/gotchas.html#why-not-make-numpy-like-r
I noticed by retweet from someone on twitter
https://mobile.twitter.com/jorisvdbossche/status/1208476049690046465
--pd.NA
appears as the meaning of missing value.
--pd.NA
can be used with IntegerArray, BooleanArray, StringArray
--With the introduction of pd.NA
, missing value can be expressed in int class as well (no careless conversion to float).
--pd.NA
is a singleton object and is consistent with all data types.
--All the return values of comparison operators for pd.NA are pd.NA
(same behavior as Julia's missing
object, R's NA
))
--Operations with logical operators follow the so-called three-valued logic
--In pd.read_csv ()
, ʻInt64 and
stringcan be specified, but
boolean` becomes an error. It's unclear if this behavior is a bug or a spec. (Probably specifications)
If you love this kind of maniac story, please come visit us at justInCase. https://www.wantedly.com/companies/justincase
I confirmed it on docker.
FROM python:3.7.6
WORKDIR /home
RUN pip install pandas==1.0.0rc0
CMD ["/bin/bash"]
$ docker build -t pdna .
$ docker run -it --rm -v $(pwd):/home/ pdna
Inside Docker
root@286578c2496b:/home# cat /etc/issue
Debian GNU/Linux 10 \n \l
root@286578c2496b:/home# uname -a
Linux 286578c2496b 4.9.184-linuxkit #1 SMP Tue Jul 2 22:58:16 UTC 2019 x86_64 GNU/Linux
root@286578c2496b:/home# python -c "import pandas as pd; pd.show_versions()"
INSTALLED VERSIONS
------------------
commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Linux
OS-release : 4.9.184-linuxkit
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.0.0rc0
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 44.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None
Recommended Posts