Pandas was updated from version 1.0 to 1.1.0 on July 28, 2020. This article summarizes the main additions to 1.1.0 and the main additions to the Nth brew, but the January 2020 update from 0.25.3 to 1.0.0.
Official information
https://pandas.pydata.org/pandas-docs/dev/whatsnew/v1.0.0.html
https://pandas.pydata.org/pandas-docs/dev/whatsnew/v1.1.0.html
Please refer to.
Version 1.0 is verified with 1.0.5 and 1.1 with 1.0.0. 0.25 is verified with 0.25.1.
1.0
pd.NA
Up to 0.25, there are various values such as np.nan
for float, np.nan
or None
for object (character string), and pd.NaT
for time data. It was used.
In 1.0, pd.NA
was introduced to represent missing values.
For example
pd.Series([1, 2, None], dtype="Int64")
The third element of is np.nan
in version 0.25, but it becomes pd.NA
in 1.0.
Until 0.25, the numeric column with missing (np.nan) was forced to float64, but in 1.0 it is possible to call it an Int8 type column with pd.NA.
A type string (StringDtype) type that represents a Series (DataFrame column) of string data has been added. When dealing with a series (or column) of a character string, it is recommended to use the string type.
Up to 0.25, it was the object type that represented the Series (or column) containing string data, so
pd.Series(['abc', True, 'def'], dtype="object")
I could only express that (mixture of letters and booleans) was allowed,
From 1.0
pd.Series(['abc', 'def'], dtype="string")
If so, the Series (or column) is only allowed as a character string.
pd.Series(['abc', True, 'def'], dtype="string")
Is an error.
pd.Series(['abc', 'def', None], dtype="string")
The third element of is pd.NA
.
However
pd.Series(['abc', True, 'def'])
(No dtype specified) is an object type as before, and this expression is also allowed.
A boolean (booleanDtype) type that represents boolean data has been added. It is recommended to use boolean type when dealing with boolean (True or False) Series (or column).
pd.Series([True, False, 0], dtype="booleal")
Is an error. (If dtype is not specified, it is allowed without error. If `dtype =" bool ", 0 is converted to False)
Regarding the handling of missing value,
pd.Series([True, False, np.nan])
pd.Series([True, False, None])
The third element of is np.nan and None, respectively.
pd.Series([True, False, np.nan], dtype="boolean")
pd.Series([True, False, None], dtype="boolean")
Then, the third element is pd.NA.
pd.Series([True, False, np.nan], dtype="bool")
pd.Series([True, False, None], dtype="bool")
In the case of, the third element is True and False, respectively.
df = pd.DataFrame({'x': ['abc', None, 'def'],
'y': [1, 2, np.nan],
'z': [True, False, True]})
Is column x: object, column y: float64, column z: bool. Even though the string type and boolean type have been created ...
Therefore,
df.convert_dtypes()
Then, x column: string, y column: Int64, z column: boolean are converted. None and np.nan are now pd.NA.
The above NA and type functions are experimental functions and are subject to change.
The ignore_index argument has been added to DataFrame.sort_values ()
and DataFrame.drop_duplicates ()
. When ignore_index = True, the index after sorting is reassigned in order from 0. Good news for pandas index annoyances.
1.1
dtype="string", astype("string")
pd.Series([1, "abc", np.nan], dtype="string")
pd.Series([1, 2, np.nan], dtype="Int64").astype("string")
All elements are strings. Up to 1.0 error if all elements are not strings or nan.
groupby
df = pd.DataFrame([[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]], columns=["a", "b", "c"])
df.groupby(by=["b"], dropna=False).sum()
The result of
a c
b
1.0 2 3
2.0 2 5
NaN 1 4
And even if the column value specified by by is added to the NA row, it will be aggregated. Behavior similar to R's dplyr group_by.
If dropna = True
or not specified, the rows whose column value specified by by is NA are not aggregated.
Recommended Posts