When reading a csv or excel file, if there is a blank cell, it will be read as a missing value (NaN).
A summary of how to find, delete, and convert matrices that contain this missing value (NaN).
Contains cells with no values entered or blank matrices.
Read csv file
import pandas as pd
df = pd.read_csv('~/desktop/test.csv')
df
See below for details on reading csv files with python.
-The blank cell becomes "NaN". -Formula errors "#REF!", "# DIV / 0!", And "#NAME?" Are not recognized as missing values (as they were). ・ The first line is recognized as a header
** 1-1. Mapping the entire table **
df.isna()
└ "df": Data with a table
Map the entire table
df.isna()
▼ Output
NaN is displayed as "True" and others are displayed as "False".
▼ Original data
pd.isna (specified range)
└ "pd.isna ()": pd becomes an object
Map specified columns
pd.isna(df['col1'])
#output
0 False
1 False
2 False
3 False
4 True
5 False
Name: col1, dtype: bool
Map specified line
pd.isna(df['col1'])
#output
0 False
1 False
2 False
3 False
4 True
5 False
Name: col1, dtype: bool
Map specified line
pd.isna(df.loc[[3]])
▽ Output
Map specified line
pd.isna(df[1:4])
▽ Output
If you put a value in "()" of pd.isna ()
, it will be judged whether it is NaN.
⇒Enter the cell element.
Judge cell (specified by matrix name)
pd.isna(df.loc[0,'col4'])
#output
#True
Judge cell (specified by matrix name)
pd.isna(df.iloc[1,3])
#output
#True
There are various ways to specify the range of a DataFrame table. Click here for details
True if applicable. False if not applicable.
Whether it is a row or a column is switched by the presence or absence of the option "axis = 1".
-Df.isna (). All ()
: ** Column ** list
・ Df.isna (). All (axis = 1)
: ** Line ** list
List of columns
df.isna().all()
#output
col1 False
Unnamed: 1 False
col3 False
col4 True
Unnamed: 4 True
col6 False
dtype: bool
The column names "col4" and "Unnamed: 4" are empty.
List of rows
df.isna().all(axis=1)
0 False
1 False
2 False
3 False
4 True
5 False
dtype: bool
Line name "4" is blank
True if applicable. False if not applicable.
Whether it is a row or a column is switched by the presence or absence of the option "axis = 1".
-Df.isna (). Any ()
: ** Column ** list
・ Df.isna (). Any (axis = 1)
: ** Line ** list
List of columns
df.isna().any()
#output
col1 True
Unnamed: 1 True
col3 True
col4 True
Unnamed: 4 True
col6 True
dtype: bool
All columns contain NaN.
List of rows
df.isna().any(axis=1)
0 True
1 True
2 True
3 True
4 True
5 True
dtype: bool
All lines contain NaN.
It can be used when you want to see the state quickly.
ʻAny ()` returns True if there is at least one True in the list, False otherwise.
Check if there is at least one blank column
df.isna().all().any()
#output
# True
If there is at least one ** all blank row ** in the table, False is returned if there is no True.
Check if there is at least one blank line
df.isna().all(axis=1).any()
#output
# True
Does it contain at least one NaN?
df.isna().any().any()
#output
# True
Whether it is a row or a column is switched by the presence or absence of the option "axis = 1".
・ Df.isna (). Sum ()
: Count number per ** column **
・ Df.isna (). Sum (axis = 1)
: Count number per ** line **
Count the NaN contained in each column
df.isna().sum()
#output
col1 1
Unnamed: 1 2
col3 1
col4 6
Unnamed: 4 6
col6 2
dtype: int64
Count NaN contained in each row
df.isna().sum(axis=1)
#output
0 2
1 3
2 3
3 2
4 6
5 2
dtype: int64
Furthermore, by doing sum ()
, the NaN number of each line is added up.
Number of NaNs in the entire table
df.isna().sum().sum()
#output
# 18
len (df ['line name'])
--count (df ['line name'])
Find the number of NaNs in the specified row or column.
② count (df ['line name'])
└ Number of elements other than NaN contained in the specified line
By setting (1)-(2), the number of NaNs contained in the specified line is calculated.
Supplement
count method: counts the number of elements other than NaN
** ▼ Target column (column containing two NaN. Column name: col6) **
Target column
df.col6
0 A
1 NaN
2 C
3 D
4 NaN
5 F
Number of NaNs in the specified row
len(df['col6'])-df['col6'].count()
#output
# 2
option | Contents |
---|---|
axis=0 | Line (optional) |
axis=1 | Column |
how='any' | Delete if there is even one NaN(Optional) |
how='all' | Delete all NaN |
thresh=n | Not NaN(Valid)Do not delete if there are n or more cells |
subset=['A',,] | Delete rows that contain NaN in the specified column name (remove NaN from the specified column name) |
subset=['a',,], axis=1 | Delete columns that contain NaN in the specified row name (remove NaN from the specified row name) |
inplace=True | Allow overwriting |
① axis = 0 (omitted) ②axis=1 ③dropna(how='all').dropna(how='all', axis=1)
dropna(how='all')
dropna(how='all', axis=0)
Remove all blank lines
df.dropna(how='all')
▽ Output
Line 4 deleted.
②axis=1
dropna(how='all', axis=1)
Delete "columns" that contain spaces.
Remove all blank columns
df.dropna(how='all')
▽ Output
Removed two columns, col4 and Unnamed: 4.
Removed rows and columns where all elements are blank.
After removing blank lines with dropna (how ='all')
Remove blank columns with dropna (how ='all', axis = 1)
.
Remove blank rows and columns
df.dropna(how='all').dropna(how='all', axis=1)
▽ Output
Row 4, column col4, Unnamed: 4 deleted.
① Default ②axis=1
There are four NaNs. (Row, Column) = (4,1), (5,2), (1,5), (4,5)
The default of the dropna method is that how ='any'
is specified as an option.
dropna()
dropna(how='any')
dropna(how='any', axis=0)
In each case, lines 1, 4, 5, and 6 including NaN have been deleted.
Since how ='any'
can be omitted, the following two processes are the same.
dropna(axis=1)
dropna(how='any', axis=1)
All contain one or more NaNs, col1, Unnamed: 1, col5 deleted.
①df.dropna(thresh=n) ②df.dropna(thresh=n, axis=1)
▽ Supplement thresh is an abbreviation for "threshold".
・ 10 rows x 6 columns table -The number of NaN is 1 in the 0th row (9 cells are not). The number of NaNs increases one column at a time. ・ All 4th columns are blank
Deleted rows with one or less cells other than NaN. Same as deleting all NaN lines.
** ▼ When thresh = 1 **
Specifying the threshold (n)=1)
df.dropna(thresh=1)
There are no NaN-only cells, so all rows remain.
Specifying the threshold (n)=4)
df.dropna(thresh=4)
Rows with 4 or less cells other than NaN, rows 1, 2 and 3 are deleted.
Specifying the threshold (n)=7)
df.dropna(thresh=7)
Since there are no rows with 7 or more cells other than NaN, delete all.
Deleted columns with one or less cells other than NaN. Same as deleting all NaN columns.
Specifying the threshold (n)=1)
df.dropna(thresh=1, axis=1)
All NaN column 4 removed.
Specifying the threshold (n)=7)
df.dropna(thresh=7, axis=1)
Columns with 7 or less cells other than NaN, rows 4, 5, and 6 are deleted.
Specifying the threshold (n)=11)
df.dropna(thresh=11, axis=1)
Delete all because there are no columns with 11 or more cells other than NaN.
①df.dropna(subset=['AAA', 'BBB']) ②df.dropna(subset=['aaa', 'bbb'], axis=1)
・ 10 rows x 6 columns ・ The row name is row x row number ・ Column name is col x column number
①df.dropna(subset=['AAA', 'BBB']) Delete the row by specifying the column name.
df.dropna(subset=['AAA', 'BBB'])
└ "df": Variable with table
└ "subset": Option to specify the name of the table or column
└ "AAA" "BBB": Column name (optional)
-Whether the deletion target is row or column is specified by the presence or absence of axis.
Delete row by specifying by column name
df.dropna(subset=['col1','col3','col5'])
Lines 1,2,3,4,5 deleted. The specified columns col1,3,5 no longer contain NaN.
Delete the column with NaN in rows 0 and row 3. (NaN disappears from the specified column)
Delete row by specifying by column name
df.dropna(subset=['row0','row3'], axis=1)
Removed columns col3, col4, col5.
Do not overwrite by default.
└inplace=False
Describe the following to allow overwriting.
└**inplace=True
**
Overwrite permission
df.dropna(thresh=5, inplace=True)
df
No overwrite
df.dropna(thresh=5)
df
The table remains the same.
The replacement method and the conditions of NaN to be replaced are specified as options.
option | Contents |
---|---|
axis=0 | Line (omitted) |
axis=1 | Column |
value | Value to convert NaN |
method='bfill' | Convert to lower cell value of NaN |
method='backfill' | 'bfill'Same as |
method='ffill' | Convert NaN to the value in the upper cell |
method='pad' | 'ffill'Same as |
method='bfill', axis=1 | Convert to NaN right cell value |
method='ffill', axis=1 | Convert to the value in the left cell of NaN |
limit=n | Replace up to nth NaN. Do not replace any more. |
inplace=True | Allow overwriting |
** ② Specify the number of NaN to replace ** 1-1. fillna('AAA', limit=n) 1-2. fillna('AAA', limit=n, axis=1) 1-3. fillna(method='ffill', limit=n) 1-4. fillna(method='ffill', limit=n, axis=1)
** ③ Allow overwriting **
fillna('AAA')
df.fillna('AAA')
** ▼ Example: When you specify to replace with the value in the right cell **
Before replacement: NaN NaN NaN AAA After replacement: AAA AAA AAA AAA
** ▼ method option list **
method | meaning | Contents |
---|---|---|
ffill | forward fill | Replace with the value in the front cell (row: top, column: left) |
pad | padding | Same as ffill |
bfill | backward fill | Replace with the value in the back cell (row: bottom, column: right) |
backfill | backward fill | Same as bfill |
When specifying a line, determine whether it is before or after based on the line number. (Up or down)
When specifying a column, determine whether it is the front or the back based on the column number. (Left or right)
error
ValueError: Cannot specify both 'value' and 'method'.
Replace with the value in the top row.
ffill
df.fillna(method='ffill')
Convert to a value above NaN. Not replaced if the top row is NaN.
Replace with the value in the front column.
ffill,axis=1
df.fillna(method='ffill', axis=1)
Replace with the value in the last row.
bfill
df.fillna(method='bfill')
Replace with the value in the back column.
bfill,axis=1
df.fillna(method='bfill', axis=1)
By using the limit option, you can specify the number of NaN to replace.
limit=n
Replace up to nth. Do not replace any more.
** Example: When limit = 2, the value to be replaced AAA, and the axis is a column **
■ Before replacement AAA NaN NaN AAA AAA AAA NaN NaN NaN AAA AAA NaN AAA NaN AAA NaN NaN AAA NaN NaN
■ After replacement AAA AAA AAA AAA AAA AAA AAA AAA NaN AAA AAA AAA AAA AAA AAA AAA AAA AAA NaN NaN
-Replace each line from the left up to the second NaN.
From the top in line number order, NaN replaces up to the second.
limit=2
df.fillna('AAA', limit=2)
Replace from top to bottom for each column. Do not replace any more.
Replace from the top to the third NaN in line number order.
limit=3,axis=1
df.fillna('AAA', limit=3, axis=1)
Replaces up to the third NaN in line number order with the previous value.
ffill,limit=3
df.fillna(method='ffill', limit=3)
Replace up to the third NaN in column number order with the previous value.
ffill,limit=3,axis=1
df.fillna(method='ffill', limit=3)
Do not overwrite by default.
└inplace=False
Describe the following to allow overwriting.
└**inplace=True
**
Overwrite permission
df.fillna('AAA', inplace=True)
df
No overwrite
df.fillna('AAA')
df
The table remains the same.
Recommended Posts