Pandas memo ~ None, np.nan, empty string ~

I was addicted to pandas around None, np.nan, so personal notes

The verified environment is as follows (there was no difference in the results)

python2.7.5, pandas==0.24.2
python3.6.1, pandas==0.25.3

Summary

	None	np.nan	Empty string
DataFrame conversion	np except when object is not specified for dtype.Converted to nan	np.nan cannot be converted to an int, so np.Columns containing nan are basically float type	Character type(Non-numeric)Because it is treated as, it is not treated as a missing value, and the column containing the empty string becomes the basic object type.
read_csv	-	Np regardless of which dtype is specified for both empty and empty strings on csv.Read as nan	-
fillna, fropna	Judged as a missing value	Judged as a missing value	Not judged as a missing value
groupby	Judged as missing value and ignored	Judged as missing value and ignored	Not judged as a missing value

inspection result

DataFrame conversion by specifying dtype

Verification of how the column type changes when the following data are specified with different dtypes


df = pd.DataFrame(
    {
        #Column A: int+None
        "A": [1, 2, 3, None],
        #Column B: str+Empty string
        "B": ["1", "2", "3", ""],
        #Column C: int+np.nan
        "C": [1, 2, 3, np.nan],
        #Column D:int only
        "D": [1, 2, 3, 4]
    }
)

dtype not specified

None seems to be converted to np.nan ... Along with that, the column containing np.nan becomes float64

Column A: None is converted to np.nan and becomes float64 type
Column B: No value conversion, object type
Column C: float64 type
Column D: No value conversion is done and it becomes int64 type


df = pd.DataFrame(
    {
        "A": [1, 2, 3, None],
        "B": ["1", "2", "3", ""],
        "C": [1, 2, 3, np.nan],
        "D": [1, 2, 3, 4]
    }
)

print(df)

     A  B    C  D
0  1.0  1  1.0  1
1  2.0  2  2.0  2
2  3.0  3  3.0  3
3  NaN     NaN  4

print(df.dtypes)

A    float64
B     object
C    float64
D      int64
dtype: object

print(df.values)

array([[1.0, '1', 1.0, 1],
       [2.0, '2', 2.0, 2],
       [3.0, '3', 3.0, 3],
       [nan, '', nan, 4]], dtype=object)

Specify object

All values have not changed, None remains the same


df = pd.DataFrame(
    {
        "A": [1, 2, 3, None],
        "B": ["1", "2", "3", ""],
        "C": [1, 2, 3, np.nan],
        "D": [1, 2, 3, 4]
    },
    dtype=object
)

print(df)

      A  B    C  D
0     1  1    1  1
1     2  2    2  2
2     3  3    3  3
3  None     NaN  4

print(df.dtypes)

A    object
B    object
C    object
D    object
dtype: object

print(df.values)

array([[1, '1', 1, 1],
       [2, '2', 2, 2],
       [3, '3', 3, 3],
       [None, '', nan, 4]], dtype=object)

Specify float

The empty string cannot be changed to float, and only the column containing the empty string becomes object type.

Column A: None is converted to np.nan and becomes float64 type
Column B: Empty string cannot be converted to float and becomes object type
Column C: float64 type
Column D: float64 type


df = pd.DataFrame(
    {
        "A": [1, 2, 3, None],
        "B": ["1", "2", "3", ""],
        "C": [1, 2, 3, np.nan],
        "D": [1, 2, 3, 4]
    },
    dtype=float
)

print(df)

      A  B    C  D
0     1  1    1  1
1     2  2    2  2
2     3  3    3  3
3  None     NaN  4

print(df.dtypes)

A    float64
B     object
C    float64
D    float64
dtype: object

print(df.values)

array([[1.0, '1', 1.0, 1.0],
       [2.0, '2', 2.0, 2.0],
       [3.0, '3', 3.0, 3.0],
       [nan, '', nan, 4.0]], dtype=object)

Specify int

Columns that cannot be converted to int64 (columns that contain np.nan or None) will be of type object

Columns A to C: Cannot be converted to int64 type and becomes object type
Column D: becomes int64 type


df = pd.DataFrame(
    {
        "A": [1, 2, 3, None],
        "B": ["1", "2", "3", ""],
        "C": [1, 2, 3, np.nan],
        "D": [1, 2, 3, 4]
    },
    dtype=int
)

print(df)

      A  B    C  D
0     1  1    1  1
1     2  2    2  2
2     3  3    3  3
3  None     NaN  4

print(df.dtypes)

A    object
B    object
C    object
D     int64
dtype: object

print(df.values)

array([[1, '1', 1, 1],
       [2, '2', 2, 2],
       [3, '3', 3, 3],
       [None, '', nan, 4]], dtype=object)

read_csv with dtype specified

Verify what happens to the column type when the following csv is specified with different dtypes

`sample.csv`


#Column A: int+Sky
#Column B:String+Empty string
#Column C: float+Sky
#Column D:int only
A,B,C,D
1,"1",1.0,1
2,"2",2.0,2
3,"3",3.0,3
,"",,4

dtype not specified

Both empty and empty strings are read as np.nan, and int is converted to float accordingly.

Column A: The empty string is converted to np.nan and becomes float64 type.
Column B: Empty string is converted to np.nan and becomes float64 type
Column C: Empty string is converted to np.nan and becomes float64 type
Column D: No value conversion is done and it becomes int64 type


df = pd.read_csv("sample.csv")

print(df)

     A    B    C  D
0  1.0  1.0  1.0  1
1  2.0  2.0  2.0  2
2  3.0  3.0  3.0  3
3  NaN  NaN  NaN  4

print(df.dtypes)

A    float64
B    float64
C    float64
D      int64
dtype: object

print(df.values)

array([[ 1.,  1.,  1.,  1.],
       [ 2.,  2.,  2.,  2.],
       [ 3.,  3.,  3.,  3.],
       [nan, nan, nan,  4.]])

Specify object

Empty and empty strings are converted to np.nan, but other values are converted to str type


df = pd.read_csv("sample.csv", dtype=object)

print(df)

     A    B    C  D
0    1    1  1.0  1
1    2    2  2.0  2
2    3    3  3.0  3
3  NaN  NaN  NaN  4

print(df.dtypes)

A    object
B    object
C    object
D    object
dtype: object

print(df.values)

array([['1', '1', '1.0', '1'],
       ['2', '2', '2.0', '2'],
       ['3', '3', '3.0', '3'],
       [nan, nan, nan, '4']], dtype=object)

Specify float

All columns are converted to float64 type


df = pd.read_csv("sample.csv", dtype=float)

print(df)

     A    B    C    D
0  1.0  1.0  1.0  1.0
1  2.0  2.0  2.0  2.0
2  3.0  3.0  3.0  3.0
3  NaN  NaN  NaN  4.0

print(df.dtypes)

A    float64
B    float64
C    float64
D    float64
dtype: object

print(df.values)

array([[ 1.,  1.,  1.,  1.],
       [ 2.,  2.,  2.,  2.],
       [ 3.,  3.,  3.,  3.],
       [nan, nan, nan,  4.]])

Specify int

Since empty and empty characters are converted to np.nan, they cannot be read as ints and an error occurs.


df = pd.read_csv("sample.csv", dtype=int)

ValueError: Integer column has NA values in column 0

Behavior at fillna and dropna

Behavior when filling the following data


df = pd.DataFrame(
    {
        #Column A: int+None
        "A": [1, 2, 3, None],
        #Column B: str+Empty string
        "B": ["1", "2", "3", ""],
        #Column C: int+np.nan
        "C": [1, 2, 3, np.nan],
        #Column D:int only
        "D": [1, 2, 3, 4]
    },
    dtype="object"
)

print(df.values)

array([[1, '1', 1, 1],
       [2, '2', 2, 2],
       [3, '3', 3, 3],
       [None, '', nan, 4]], dtype=object)

If you do df.fillna ('FILL'), the values of None and np.nan will be converted, but the empty string will remain.


print(df.fillna('FILL'))

      A  B     C  D
0     1  1     1  1
1     2  2     2  2
2     3  3     3  3
3  FILL     FILL  4

print(df.fillna('FILL').values)

array([[1, '1', 1, 1],
       [2, '2', 2, 2],
       [3, '3', 3, 3],
       ['FILL', '', 'FILL', 4]], dtype=object)

Similarly, the behavior at the time of dropna is that the rows and columns containing np.nan and None are deleted, but empty strings are not treated as missing values.


print(df.dropna(axis=1))

   B  D
0  1  1
1  2  2
2  3  3
3     4

print(df.dropna(axis=1).values)

array([['1', 1],
       ['2', 2],
       ['3', 3],
       ['', 4]], dtype=object)

Behavior when group by

Perform verification using the following data frame


df = pd.DataFrame(
    {
        #Column A: int+None
        "A": [1, 2, 3, None],
        #Column B: str+Empty string
        "B": ["1", "2", "3", ""],
        #Column C: int+np.nan
        "C": [1, 2, 3, np.nan],
        #Column D:int only
        "D": [1, 2, 3, 4]
    },
    dtype="object"
)

When grouping in a column containing None, np.nan, the rows of None, np.nan are ignored (missing).


print(df.groupby("A").max().reset_index())

   A  B  C  D
0  1  1  1  1
1  2  2  2  2
2  3  3  3  3

print(df.groupby("A").max().reset_index().values)

array([[1, '1', 1, 1],
       [2, '2', 2, 2],
       [3, '3', 3, 3]], dtype=object)

print(df.groupby("C").max().reset_index())

   C  A  B  D
0  1  1  1  1
1  2  2  2  2
2  3  3  3  3

print(df.groupby("C").max().reset_index().values)

array([[1, 1, '1', 1],
       [2, 2, '2', 2],
       [3, 3, '3', 3]], dtype=object)

If the column contains an empty string, it will not be ignored


print(df.groupby("B").max().reset_index())

   B    A    C  D
0     NaN  NaN  4
1  1  1.0  1.0  1
2  2  2.0  2.0  2
3  3  3.0  3.0  3

print(df.groupby("B").max().reset_index().values)

array([[1, 1, '1', 1],
       [2, 2, '2', 2],
       [3, 3, '3', 3]], dtype=object)