I was addicted to pandas around None, np.nan, so personal notes
The verified environment is as follows (there was no difference in the results)
None | np.nan | Empty string | |
---|---|---|---|
DataFrame conversion | np except when object is not specified for dtype.Converted to nan | np.nan cannot be converted to an int, so np.Columns containing nan are basically float type | Character type(Non-numeric)Because it is treated as, it is not treated as a missing value, and the column containing the empty string becomes the basic object type. |
read_csv | - | Np regardless of which dtype is specified for both empty and empty strings on csv.Read as nan | - |
fillna, fropna | Judged as a missing value | Judged as a missing value | Not judged as a missing value |
groupby | Judged as missing value and ignored | Judged as missing value and ignored | Not judged as a missing value |
Verification of how the column type changes when the following data are specified with different dtypes
df = pd.DataFrame(
{
#Column A: int+None
"A": [1, 2, 3, None],
#Column B: str+Empty string
"B": ["1", "2", "3", ""],
#Column C: int+np.nan
"C": [1, 2, 3, np.nan],
#Column D:int only
"D": [1, 2, 3, 4]
}
)
None seems to be converted to np.nan ... Along with that, the column containing np.nan becomes float64
df = pd.DataFrame(
{
"A": [1, 2, 3, None],
"B": ["1", "2", "3", ""],
"C": [1, 2, 3, np.nan],
"D": [1, 2, 3, 4]
}
)
print(df)
A B C D
0 1.0 1 1.0 1
1 2.0 2 2.0 2
2 3.0 3 3.0 3
3 NaN NaN 4
print(df.dtypes)
A float64
B object
C float64
D int64
dtype: object
print(df.values)
array([[1.0, '1', 1.0, 1],
[2.0, '2', 2.0, 2],
[3.0, '3', 3.0, 3],
[nan, '', nan, 4]], dtype=object)
All values have not changed, None remains the same
df = pd.DataFrame(
{
"A": [1, 2, 3, None],
"B": ["1", "2", "3", ""],
"C": [1, 2, 3, np.nan],
"D": [1, 2, 3, 4]
},
dtype=object
)
print(df)
A B C D
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
3 None NaN 4
print(df.dtypes)
A object
B object
C object
D object
dtype: object
print(df.values)
array([[1, '1', 1, 1],
[2, '2', 2, 2],
[3, '3', 3, 3],
[None, '', nan, 4]], dtype=object)
The empty string cannot be changed to float, and only the column containing the empty string becomes object type.
df = pd.DataFrame(
{
"A": [1, 2, 3, None],
"B": ["1", "2", "3", ""],
"C": [1, 2, 3, np.nan],
"D": [1, 2, 3, 4]
},
dtype=float
)
print(df)
A B C D
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
3 None NaN 4
print(df.dtypes)
A float64
B object
C float64
D float64
dtype: object
print(df.values)
array([[1.0, '1', 1.0, 1.0],
[2.0, '2', 2.0, 2.0],
[3.0, '3', 3.0, 3.0],
[nan, '', nan, 4.0]], dtype=object)
Columns that cannot be converted to int64 (columns that contain np.nan or None) will be of type object
df = pd.DataFrame(
{
"A": [1, 2, 3, None],
"B": ["1", "2", "3", ""],
"C": [1, 2, 3, np.nan],
"D": [1, 2, 3, 4]
},
dtype=int
)
print(df)
A B C D
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
3 None NaN 4
print(df.dtypes)
A object
B object
C object
D int64
dtype: object
print(df.values)
array([[1, '1', 1, 1],
[2, '2', 2, 2],
[3, '3', 3, 3],
[None, '', nan, 4]], dtype=object)
Verify what happens to the column type when the following csv is specified with different dtypes
sample.csv
#Column A: int+Sky
#Column B:String+Empty string
#Column C: float+Sky
#Column D:int only
A,B,C,D
1,"1",1.0,1
2,"2",2.0,2
3,"3",3.0,3
,"",,4
Both empty and empty strings are read as np.nan, and int is converted to float accordingly.
df = pd.read_csv("sample.csv")
print(df)
A B C D
0 1.0 1.0 1.0 1
1 2.0 2.0 2.0 2
2 3.0 3.0 3.0 3
3 NaN NaN NaN 4
print(df.dtypes)
A float64
B float64
C float64
D int64
dtype: object
print(df.values)
array([[ 1., 1., 1., 1.],
[ 2., 2., 2., 2.],
[ 3., 3., 3., 3.],
[nan, nan, nan, 4.]])
Empty and empty strings are converted to np.nan, but other values are converted to str type
df = pd.read_csv("sample.csv", dtype=object)
print(df)
A B C D
0 1 1 1.0 1
1 2 2 2.0 2
2 3 3 3.0 3
3 NaN NaN NaN 4
print(df.dtypes)
A object
B object
C object
D object
dtype: object
print(df.values)
array([['1', '1', '1.0', '1'],
['2', '2', '2.0', '2'],
['3', '3', '3.0', '3'],
[nan, nan, nan, '4']], dtype=object)
All columns are converted to float64 type
df = pd.read_csv("sample.csv", dtype=float)
print(df)
A B C D
0 1.0 1.0 1.0 1.0
1 2.0 2.0 2.0 2.0
2 3.0 3.0 3.0 3.0
3 NaN NaN NaN 4.0
print(df.dtypes)
A float64
B float64
C float64
D float64
dtype: object
print(df.values)
array([[ 1., 1., 1., 1.],
[ 2., 2., 2., 2.],
[ 3., 3., 3., 3.],
[nan, nan, nan, 4.]])
Since empty and empty characters are converted to np.nan, they cannot be read as ints and an error occurs.
df = pd.read_csv("sample.csv", dtype=int)
ValueError: Integer column has NA values in column 0
Behavior when filling the following data
df = pd.DataFrame(
{
#Column A: int+None
"A": [1, 2, 3, None],
#Column B: str+Empty string
"B": ["1", "2", "3", ""],
#Column C: int+np.nan
"C": [1, 2, 3, np.nan],
#Column D:int only
"D": [1, 2, 3, 4]
},
dtype="object"
)
print(df.values)
array([[1, '1', 1, 1],
[2, '2', 2, 2],
[3, '3', 3, 3],
[None, '', nan, 4]], dtype=object)
If you do df.fillna ('FILL')
, the values of None and np.nan will be converted, but the empty string will remain.
print(df.fillna('FILL'))
A B C D
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
3 FILL FILL 4
print(df.fillna('FILL').values)
array([[1, '1', 1, 1],
[2, '2', 2, 2],
[3, '3', 3, 3],
['FILL', '', 'FILL', 4]], dtype=object)
Similarly, the behavior at the time of dropna is that the rows and columns containing np.nan and None are deleted, but empty strings are not treated as missing values.
print(df.dropna(axis=1))
B D
0 1 1
1 2 2
2 3 3
3 4
print(df.dropna(axis=1).values)
array([['1', 1],
['2', 2],
['3', 3],
['', 4]], dtype=object)
Perform verification using the following data frame
df = pd.DataFrame(
{
#Column A: int+None
"A": [1, 2, 3, None],
#Column B: str+Empty string
"B": ["1", "2", "3", ""],
#Column C: int+np.nan
"C": [1, 2, 3, np.nan],
#Column D:int only
"D": [1, 2, 3, 4]
},
dtype="object"
)
When grouping in a column containing None, np.nan, the rows of None, np.nan are ignored (missing).
print(df.groupby("A").max().reset_index())
A B C D
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
print(df.groupby("A").max().reset_index().values)
array([[1, '1', 1, 1],
[2, '2', 2, 2],
[3, '3', 3, 3]], dtype=object)
print(df.groupby("C").max().reset_index())
C A B D
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
print(df.groupby("C").max().reset_index().values)
array([[1, 1, '1', 1],
[2, 2, '2', 2],
[3, 3, '3', 3]], dtype=object)
If the column contains an empty string, it will not be ignored
print(df.groupby("B").max().reset_index())
B A C D
0 NaN NaN 4
1 1 1.0 1.0 1
2 2 2.0 2.0 2
3 3 3.0 3.0 3
print(df.groupby("B").max().reset_index().values)
array([[1, 1, '1', 1],
[2, 2, '2', 2],
[3, 3, '3', 3]], dtype=object)