Pandas 1.0.0 was released on January 29, 2020! Crackling As of 02/14/2020, it is 1.0.1.
Personally, I think the following changes are important points.
--pandas original NA
--Experimental for String type
Well.
When analyzing, I often use the following libraries and pandas together.
In particular, I would like to sort out dask's pandas 1.0 support status and other detailed behavior. The version of dask is 2.10.1 as of 02/14/2020.
Regarding intake, I think that there is no problem if dask supports it. (There is also a time when the processing wait time of dask is free.)
-Can dask use pandas.NA
properly? (Related to ver 1.0)
--Can dask use dtype: string
properly? (Related to ver 1.0)
--I / O, especially fastparquet, can't input / output properly with pandas.NA
or dtype: string
? (Related to ver 1.0)
――No, can dask be used properly with dtype: categorical
? (Other)
Tom Augspurger seems to support pandas 1.0 like a demon, and I have high expectations so far.
For those who want to know only the result.
--Dask can perform four arithmetic operations and character string operations even if it contains pandas.NA
.
--dask cannot be set_index
with custom types such as ʻInt64,
string --Both pandas / dask cannot index filter on
boolean columns containing
pandas.NA --Dask becomes ʻobject
type even if ʻapply (meta ='string'), but it can be revived by ʻas type ('string')
.
--When using pandas.Categorical in dask, it seems that filtering and aggregation are not possible.
--It seems that you need to use ʻas type when adding a new Categorical column to dask DataFrame --dask cannot to_parquet types ʻInt64
and string
(when engine = fastparquet).
For the time being, prepare a clean verification environment.
The OS is macOS Catalina 10.15.2
.
Regarding the python version, pandas only sets minimum, and dask seems to be python3.8 compatible, so If it is 3.7.4
, there will be no problem.
For dependencies, include minimum versions for dependencies. However, fastparquet and pyArrow cannot coexist on mac problem, so I will not include pyArrow just in case. I don't use it.
Verification work is done on jupyterlab.
pyenv virtualenv 3.7.4 pandas100
pyenv shell pandas100
pip install -r requirements.txt
requirements.txt
pandas==1.0.1
dask[complete]==2.10.1
fastparquet==0.3.3
jupyterlab==1.2.6
numpy==1.18.1
pytz==2019.3
python-dateutil==2.8.1
numexpr==2.7.1
beautifulsoup4==4.8.2
gcsfs==0.6.0
lxml==4.5.0
matplotlib==3.1.3
numba==0.48.0
openpyxl==3.0.3
pymysql==0.9.3
tables==3.6.1
s3fs==0.4.0
scipy==1.4.1
sqlalchemy==1.3.13
xarray==0.15.0
xlrd==1.2.0
xlsxwriter==1.2.7
xlwt==1.3.0
dask vs pandas1.0
First, check pd.NA.
Check the behavior of pd.NA in each
s=... |
type(s.loc[3]) |
---|---|
pandas.Series([1,2,3,None], dtype='int') | TypeError |
pandas.Series([1,2,3,pandas.NA], dtype='int') | TypeError |
pandas.Series([1,2,3,None], dtype='Int64') | pandas._libs.missing.NAType |
pandas.Series([1,2,3,None], dtype='float') | numpy.float64 |
pandas.Series([1,2,3,pandas.NA], dtype='float') | TypeError |
pandas.Series([1,2,3,None], dtype='Int64').astype('float') | numpy.float64 |
pandas.Series(['a', 'b', 'c' ,None], dtype='string') | pandas._libs.missing.NAType |
pandas.Series(['a', 'b', 'c' ,None], dtype='object').astype('string') | pandas._libs.missing.NAType |
pandas.Series([True, False, True ,None], dtype='boolean') | pandas._libs.missing.NAType |
pandas.Series([1, 0, 1 ,None], dtype='float').astype('boolean') | pandas._libs.missing.NAType |
pandas.Series(pandas.to_datetime(['2020-01-01', '2020-01-02', '2020-01-03', None])) | pandas._libs.tslibs.nattype.NaTType |
pandas.Series(pandas.to_timedelta(['00:00:01', '00:00:02', '00:00:03', None])) | pandas._libs.tslibs.nattype.NaTType |
pandas.Series([object(), object(), object(), None], dtype='object') | NoneType |
pandas.Series([object(), object(), object(), pandas.NA], dtype='object') | pandas.Series([object(), object(), object(), pandas.NA], dtype='object') |
Summary,
--dtype int does not become pandas.NA
(TypeError as it is)
--dtype Int64, string, boolean becomes pandas.NA
.
--dtype float becomes numpy.NaN
--dtype datetime64, timedelta64 becomes NAT
--dtype object does not automatically convert None
to pandas.NA
Investigate what happens if you do this with dask.dataframe.from_pandas
.
>>> import pandas
>>> import dask.dataframe
>>> df = pandas.DataFrame({'i': [1,2,3,4],
... 'i64': pandas.Series([1,2,3,None], dtype='Int64'),
... 's': pandas.Series(['a', 'b', 'c' ,None], dtype='string'),
... 'f': pandas.Series([1,2,3,None], dtype='Int64').astype('float')})
>>> ddf = dask.dataframe.from_pandas(df, npartitions=1)
>>> df
i i64 s f
0 1 1 a 1.0
1 2 2 b 2.0
2 3 3 c 3.0
3 4 <NA> <NA> NaN
>>> ddf
Dask DataFrame Structure:
i i64 s f
npartitions=1
0 int64 Int64 string float64
3 ... ... ... ...
Indeed, ʻInt64 is also ʻInt64
on dask. The same is true for string
.
>>> #(Integer) operation on Int64
>>> df.i64 * 2
0 2
1 4
2 6
3 <NA>
Name: i64, dtype: Int64
>>> (ddf.i64 * 2).compute()
0 2
1 4
2 6
3 <NA>
Name: i64, dtype: Int64
Int64-> Int64 processing works fine.
>>> #(Floating point) operation for Int64
>>> df.i64 - df.f
0 0.0
1 0.0
2 0.0
3 NaN
dtype: float64
>>> (ddf.i64 - ddf.f).compute()
0 0.0
1 0.0
2 0.0
3 NaN
dtype: float64
The processing of Int64-> float64 also works properly.
>>> # pandas.Set in Int64 columns containing NA_index
>>> df.set_index('i64')
i s f i64_result i64-f
i64
1 1 a 1.0 2 0.0
2 2 b 2.0 4 0.0
3 3 c 3.0 6 0.0
NaN 4 <NA> NaN <NA> NaN
>>> ddf.set_index('i64').compute()
TypeError: data type not understood
>>> # pandas.What would happen without NA
>>> ddf['i64_nonnull'] = ddf.i64.fillna(1)
... ddf.set_index('i64_nonnull').compute()
TypeError: data type not understood
Eh! dask can't set_index
in the ʻInt64` column!
Of course you can do pandas.
>>> # pandas.Set in a string column containing NA_index
>>> df.set_index('s')
i i64 f
s
a 1 1 1.0
b 2 2 2.0
c 3 3 3.0
NaN 4 <NA> NaN
>>> ddf.set_index('s').compute()
TypeError: Cannot perform reduction 'max' with string dtype
>>> # pandas.What would happen without NA
>>> ddf['s_nonnull'] = ddf.s.fillna('a')
... ddf.set_index('s_nonnull')
TypeError: Cannot perform reduction 'max' with string dtype
I can't do string
either. This can't be used yet (in my usage).
# .Try the str function
>>> df.s.str.startswith('a')
0 True
1 False
2 False
3 <NA>
Name: s, dtype: boolean
>>> ddf.s.str.startswith('a').compute()
0 True
1 False
2 False
3 <NA>
Name: s, dtype: boolean
Hmmm, this works.
>>> # pandas.Filter by boolean column containing NA
>>> df[df.s.str.startswith('a')]
ValueError: cannot mask with array containing NA / NaN values
>>> # pandas.Is NA bad?
>>> df['s_nonnull'] = df.s.fillna('a')
... df[df.s_nonnull.str.startswith('a')]
i i64 s f i64_nonnull s_nonnull
0 1 1 a 1.0 1 a
3 4 <NA> <NA> NaN 1 a
>>> ddf[ddf.s.str.startswith('a')].compute()
ValueError: cannot mask with array containing NA / NaN values
>>> ddf['s_nonnull'] = ddf.s.fillna('a')
... ddf[ddf.s_nonnull.str.startswith('a')].compute()
i i64 s f i64_nonnull s_nonnull
0 1 1 a 1.0 1 a
3 4 <NA> <NA> NaN 1 a
>>> ddf[ddf.s.str.startswith('a')].compute()
e! !! !! Can't I filter if I include pandas.NA? This is no good!
>>> #apply to meta='Int64'Try to specify
>>> ddf['i10'] = ddf.i.apply(lambda v: v * 10, meta='Int64')
>>> ddf
Dask DataFrame Structure:
i i64 s f i64_nonnull s_nonnull i10
npartitions=1
0 int64 Int64 string float64 Int64 string int64
3 ... ... ... ... ... ... ...
>>> #apply to meta='string'Try to specify
>>> ddf['s_double'] = ddf.s.apply(lambda v: v+v, meta='string')
Dask DataFrame Structure:
i i64 s f i64_nonnull s_nonnull i10 s_double
npartitions=1
0 int64 Int64 string float64 Int64 string int64 object
3 ... ... ... ... ... ... ... ...
>>> # astype('string')Try
>>> ddf['s_double'] = ddf['s_double'].astype('string')
>>> ddf
Dask DataFrame Structure:
i i64 s f i64_nonnull s_nonnull i10 s_double
npartitions=1
0 int64 Int64 string float64 Int64 string int64 string
3 ... ... ... ... ... ... ... ...
If you specify it with meta =, is it not reflected? .. .. It can be revived with astype, but it's a hassle. .. ..
--Calculation is OK --In dask, it cannot be used as an Index (because the type that supports pandas.NA cannot be used in the first place) --Cannot filter with both pandas / dask! --It is ignored even if .apply (meta ='string') etc. You have to astype.
dask vs pandas.Categorical
In order to investigate Categorical in pandas, this time we will use the method using Categorical Dtype. The basic usage of Categorical Dtype is
I think it is. Sample code below
>>> #First, create a Categorical Dtype
>>> int_category = pandas.CategoricalDtype(categories=[1,2,3,4,5],
... ordered=True)
>>> int_category
CategoricalDtype(categories=[1, 2, 3, 4, 5], ordered=True)
>>> int_category.categories
Int64Index([1, 2, 3, 4, 5], dtype='int64')
>>> #Like this pandas.Make a Series
>>> int_series = pandas.Series([1,2,3], dtype=int_category)
>>> int_series
0 1
1 2
2 3
dtype: category
Categories (5, int64): [1 < 2 < 3 < 4 < 5]
>>> #At the time of generation, it will convert values that are not in the category to NaN
>>> int_series = pandas.Series([1,2,3,6], dtype=int_category)
>>> int_series
0 1
1 2
2 3
3 NaN
dtype: category
Categories (5, int64): [1 < 2 < 3 < 4 < 5]
>>> #Get angry after generation
>>> int_series.loc[3] = 10
ValueError: Cannot setitem on a Categorical with a new category, set the categories first
Next, try using Categorical on dask.
>>> import pandas
>>> import dask.dataframe
>>> # pandas.Generate DataFrame
>>> df = pandas.DataFrame({'a': pandas.Series([1, 2, 3, 1, 2, 3], dtype=int_category),
... 'b': pandas.Series([1, 2, 3, 1, 2, 3], dtype='int64')})
>>> df
a b
0 1 1
1 2 2
2 3 3
3 1 1
4 2 2
5 3 3
>>> # dask.dataframe.Convert to DataFrame
>>> ddf = dask.dataframe.from_pandas(df, npartitions=1)
>>> ddf
Dask DataFrame Structure:
a b
npartitions=1
0 category[known] int64
5 ... ...
For the time being, I was able to make it dask as it is categorical.
#Adding new category values is legal in pandas
>>> df.loc[2, 'a'] = 30
ValueError: Cannot setitem on a Categorical with a new category, set the categories first
#In dask, it can not be assigned in the first place regardless of Categorical
>>> ddf.loc['a', 3] = 10
TypeError: '_LocIndexer' object does not support item assignment
#In pandas, the calculation of category values is also legal
>>> df.a * 2
TypeError: unsupported operand type(s) for *: 'Categorical' and 'int'
#Even in dask, the calculation of category values is legal
>>> ddf.a * 2
TypeError: unsupported operand type(s) for *: 'Categorical' and 'int'
#Try to specify as meta with apply of dask
>>> ddf['c'] = ddf.a.apply(lambda v: v, meta=int_category)
Dont know how to create metadata from category
#dask apply, meta='category'Will you do your best if you do?
>>> ddf['c'] = ddf.a.apply(lambda v: v, meta='category')
>>> ddf.dtypes
a category
b int64
c object
dtype: object
>>> #Check if it is consistent with the contents
>>> ddf.compute().dtypes
a category
b int64
c category
dtype: object
>>> #try astype
>>> ddf['c'] = ddf.c.astype(int_category)
>>> ddf
Dask DataFrame Structure:
a b c
npartitions=1
0 category[known] int64 category[known]
5 ... ... ...
I see. The constraint part of the category is maintained, but if you do .apply (meta =)
, dask's dtype management will be buggy.
It's possible to revive it with astype, but it's a hassle. .. ..
Isn't it possible to use only filters?
#Try to aggregate
>>> ddf.groupby('a').b.mean().compute()
a
1 1.0
2 2.0
3 3.0
4 NaN
5 NaN
Name: b, dtype: float64
#Isn't the type broken by being treated as an Index?
Dask DataFrame Structure:
a b
npartitions=1
category[known] float64
... ...
Dask Name: reset_index, 34 tasks
Hmmm, do you feel that it corresponds to aggregation?
--When using pandas.Categorical with dask, filters and aggregates seem to be fine --If you want to add a new Categorical column to Dask's DataFrame, use ʻas type`
to_parquet vs pandas1.0
>>> #First of all, pandas.Generate DataFrame
>>> df = pandas.DataFrame(
{
'i64': pandas.Series([1, 2, 3,None], dtype='Int64'),
'i64_nonnull': pandas.Series([1, 2, 3, 4], dtype='Int64'),
's': pandas.Series(['a', 'b', 'c',None], dtype='string'),
's_nonnull': pandas.Series(['a', 'b', 'c', 'd'], dtype='string'),
}
)
>>> df
i64 i64_nonnull s s_nonnull
0 1 1 a a
1 2 2 b b
2 3 3 c c
3 <NA> 4 <NA> d
>>> # dask.dataframe.Convert to DataFrame
>>> ddf = dask.dataframe.from_pandas(df, npartitions=1)
>>> ddf
Dask DataFrame Structure:
i64 i64_nonnull s s_nonnull
npartitions=1
0 Int64 Int64 string string
3 ... ... ... ...
For the time being, try to_parquet.
>>> ddf.to_parquet('test1', engine='fastparquet')
ValueError: Dont know how to convert data type: Int64
seriously. .. .. I was expecting it. .. .. Even if Int64 is not good, string may be possible. .. ..
>>> ddf.to_parquet('test2', engine='fastparquet')
ValueError: Dont know how to convert data type: string
It was bad.
--Int64 and string cannot be to_parquet.
How was that? Perhaps no one has read this comment to the end. Should I have separated the posts?
pandas 1.0 I hope it helps people who are thinking about it.
See you soon.
Recommended Posts