About this article

Pandas was updated from version 1.0 to 1.1.0 on July 28, 2020. This article summarizes the main additions to 1.1.0 and the main additions to the Nth brew, but the January 2020 update from 0.25.3 to 1.0.0.

Official information

https://pandas.pydata.org/pandas-docs/dev/whatsnew/v1.0.0.html

https://pandas.pydata.org/pandas-docs/dev/whatsnew/v1.1.0.html

Please refer to.

Verification environment

Version 1.0 is verified with 1.0.5 and 1.1 with 1.0.0. 0.25 is verified with 0.25.1.

1.0

pd.NA Up to 0.25, there are various values such as np.nan for float, np.nan or None for object (character string), and pd.NaT for time data. It was used.

In 1.0, pd.NA was introduced to represent missing values.

For example

pd.Series([1, 2, None], dtype="Int64")

The third element of is np.nan in version 0.25, but it becomes pd.NA in 1.0.

Until 0.25, the numeric column with missing (np.nan) was forced to float64, but in 1.0 it is possible to call it an Int8 type column with pd.NA.

string (StringD type) type

A type string (StringDtype) type that represents a Series (DataFrame column) of string data has been added. When dealing with a series (or column) of a character string, it is recommended to use the string type.

Up to 0.25, it was the object type that represented the Series (or column) containing string data, so

pd.Series(['abc', True, 'def'], dtype="object")

I could only express that (mixture of letters and booleans) was allowed,

From 1.0

pd.Series(['abc', 'def'], dtype="string")

If so, the Series (or column) is only allowed as a character string.

pd.Series(['abc', True, 'def'], dtype="string")

Is an error.

pd.Series(['abc', 'def', None], dtype="string")

The third element of is pd.NA.

However

pd.Series(['abc', True, 'def'])

(No dtype specified) is an object type as before, and this expression is also allowed.

boolean (booleanDtype) type

A boolean (booleanDtype) type that represents boolean data has been added. It is recommended to use boolean type when dealing with boolean (True or False) Series (or column).

pd.Series([True, False, 0], dtype="booleal")

Is an error. (If dtype is not specified, it is allowed without error. If `dtype =" bool ", 0 is converted to False)

Regarding the handling of missing value,

pd.Series([True, False, np.nan])
pd.Series([True, False, None])

The third element of is np.nan and None, respectively.

pd.Series([True, False, np.nan], dtype="boolean")
pd.Series([True, False, None], dtype="boolean")

Then, the third element is pd.NA.

pd.Series([True, False, np.nan], dtype="bool")
pd.Series([True, False, None], dtype="bool")

In the case of, the third element is True and False, respectively.

convert_dtypes function

df = pd.DataFrame({'x': ['abc', None, 'def'],
                   'y': [1, 2, np.nan],
                   'z': [True, False, True]})

Is column x: object, column y: float64, column z: bool. Even though the string type and boolean type have been created ...

Therefore,

df.convert_dtypes()

Then, x column: string, y column: Int64, z column: boolean are converted. None and np.nan are now pd.NA.

The above NA and type functions are experimental functions and are subject to change.

ignore_index argument

The ignore_index argument has been added to DataFrame.sort_values () and DataFrame.drop_duplicates (). When ignore_index = True, the index after sorting is reassigned in order from 0. Good news for pandas index annoyances.

1.1

dtype="string", astype("string")

pd.Series([1, "abc", np.nan], dtype="string")
pd.Series([1, 2, np.nan], dtype="Int64").astype("string")

All elements are strings. Up to 1.0 error if all elements are not strings or nan.

groupby

df = pd.DataFrame([[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]], columns=["a", "b", "c"])

df.groupby(by=["b"], dropna=False).sum()

The result of

And even if the column value specified by by is added to the NA row, it will be aggregated. Behavior similar to R's dplyr group_by.

If dropna = True or not specified, the rows whose column value specified by by is NA are not aggregated.