[Python] How to extract / delete / convert a matrix containing missing values (NaN)

When reading a csv or excel file, if there is a blank cell, it will be read as a missing value (NaN).

A summary of how to find, delete, and convert matrices that contain this missing value (NaN).

**table of contents**

[Identification of elements / matrices containing missing values](# 1 Identification of element matrices containing missing values)
1. isna()
2. isnull()
[Count the number of missing values](# 2 Count the number of missing values)
[Delete missing value](# 3 Delete missing value)
[List of dropna options](List of options for #dropna)
[Delete all missing value matrix](# 5 Delete all missing value matrix)
[axis = 0 (omitted)](# axis0 omitted) 2. axis=1 3. dropna(how='all').dropna(how='all', axis=1)
[Delete by specifying the number of non-missing cells](# 6 Delete by specifying the number of non-missing cells)
1. df.dropna(thresh=n)
2. df.dropna(thresh=n, axis=1)
[Specify the row / column where you want to delete NaN](Specify the matrix where you want to delete # 7nan)
1. df.dropna(subset=['AAA', 'BBB'])
2. df.dropna(subset=['aaa', 'bbb'], axis=1)
[Allow delete overwrite](# 8 Allow delete overwrite)
[Convert Missing Value](# 9 Convert Missing Value)
[Replace entire table](#Replace entire table) 1. fillna('AAA')
[method option](# 1-2-method option) 3. fillna(method='ffill') 4. fillna(method='ffill', axis=1) 5. fillna(method='bfill') 6. fillna(method='bfill', axis=1)
[Specify how many NaNs to replace](#Specify how many nans to replace) 1. fillna('AAA', limit=n) 2. fillna('AAA', limit=n, axis=1) 3. fillna(method='ffill', limit=n) 4. fillna(method='ffill', limit=n, axis=1)
[Allow overwrite](# Allow overwrite)

## 1. 1. Identification of elements / matrices containing missing values **①isna()** 　1-1. isna().all() 　1-2. isna().any() 　1-3. isna().all().any() 　1-4. isna().any().any() **②isnull()**
** ▼ Usage ** -** Element identification ** is ** isna method **. -To find out if ** all ** of a row or column is NaN, combine the ** isna method and the all method **. -To find out if there is NaN in the ** element of the row or column **, combine the ** isna method and any method **. -Isnull method is ** isna method alias ** (alias). The official page mainly uses isna.
## Original data to use Read the 6-by-6 csv file as shown below.

Contains cells with no values entered or blank matrices.

▼ [Reference] Read the test.csv file on the desktop.

`Read csv file`


import pandas as pd
df = pd.read_csv('~/desktop/test.csv')
df

See below for details on reading csv files with python.

How to read csv file Application for reading csv files

### Read data (stored in variable df) ![image.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/563526/d1656f44-8997-228f-ae49-1d46eb24da4c.png)

-The blank cell becomes "NaN". -Formula errors "#REF!", "# DIV / 0!", And "#NAME?" Are not recognized as missing values (as they were). ・ The first line is recognized as a header

①isna() Used to check if each element is NaN. If it is NaN, "True" is returned, otherwise "False" is returned. (Boolean type: boolean)

** 1-1. Mapping the entire table **

df.isna() └ "df": Data with a table

`Map the entire table`


df.isna()

▼ Output

NaN is displayed as "True" and others are displayed as "False".

▼ Original data

** 1-2. Mapping by specific row, column, element ** It is also possible to map a specific matrix or element by specifying the target range.

pd.isna (specified range) └ "pd.isna ()": pd becomes an object

** ▼ Map specified columns **

`Map specified columns`


pd.isna(df['col1'])

#output
0    False
1    False
2    False
3    False
4     True
5    False
Name: col1, dtype: bool

** ▼ Map specified line **

`Map specified line`


pd.isna(df['col1'])

#output
0    False
1    False
2    False
3    False
4     True
5    False
Name: col1, dtype: bool

** ▼ Map specified line **

`Map specified line`


pd.isna(df.loc[[3]])

▽ Output

** ▼ Map multiple lines **

`Map specified line`


pd.isna(df[1:4])

▽ Output

** ▼ Judge the specified cell **

If you put a value in "()" of pd.isna (), it will be judged whether it is NaN. ⇒Enter the cell element.

** ① Judge the cell with row name "0" and column name "col4" ** -Cell specification: `df.loc [0,'col4']`

`Judge cell (specified by matrix name)`


pd.isna(df.loc[0,'col4'])

#output
#True

** ② Judge the cell with row number "1" and column number "3" ** -Cell specification: `df.iloc [1,3]`

`Judge cell (specified by matrix name)`


pd.isna(df.iloc[1,3])

#output
#True

There are various ways to specify the range of a DataFrame table. Click here for details

1-1. isna().all() List rows or columns with ** all NaN ** contents.

True if applicable. False if not applicable.

Whether it is a row or a column is switched by the presence or absence of the option "axis = 1".

-Df.isna (). All () : ** Column ** list ・ Df.isna (). All (axis = 1): ** Line ** list

** ▼ Original table ** ![image.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/563526/bd35b75c-b12b-62fc-75b3-339b4a4f126e.png)
** ▼ List of columns ** `df.isna().all()`

`List of columns`


df.isna().all()

#output
col1          False
Unnamed: 1    False
col3          False
col4           True
Unnamed: 4     True
col6          False
dtype: bool

The column names "col4" and "Unnamed: 4" are empty.

** ▼ List of lines ** `df.isna().all(axis=1)`

`List of rows`


df.isna().all(axis=1)

0    False
1    False
2    False
3    False
4     True
5    False
dtype: bool

Line name "4" is blank

1-2. isna().any() List rows or columns that contain ** at least one NaN **.

True if applicable. False if not applicable.

Whether it is a row or a column is switched by the presence or absence of the option "axis = 1".

-Df.isna (). Any () : ** Column ** list ・ Df.isna (). Any (axis = 1): ** Line ** list

** ▼ Original table ** ![image.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/563526/bd35b75c-b12b-62fc-75b3-339b4a4f126e.png)
** ▼ List of columns ** `df.isna().any()`

`List of columns`


df.isna().any()

#output
col1          True
Unnamed: 1    True
col3          True
col4          True
Unnamed: 4    True
col6          True
dtype: bool

All columns contain NaN.

** ▼ List of lines ** `df.isna().any(axis=1)`

`List of rows`


df.isna().any(axis=1)

0    True
1    True
2    True
3    True
4    True
5    True
dtype: bool

All lines contain NaN.

1-3. isna().all().any() ** If there is at least one blank column in the table ** returns False if there is no True.

It can be used when you want to see the state quickly.

** ▼ What you are doing ** Get a list showing the state of each column as True or False with ʻisna (). All`.

ʻAny ()` returns True if there is at least one True in the list, False otherwise.

`Check if there is at least one blank column`


df.isna().all().any()

#output
# True

1-3'. isna().all(axis=1).any() By specifying "axis = 1" in the option of all method,

If there is at least one ** all blank row ** in the table, False is returned if there is no True.

`Check if there is at least one blank line`


df.isna().all(axis=1).any()

#output
# True

1-4. isna().any().any() Returns True ** if there is at least one NaN in the table, False otherwise.

`Does it contain at least one NaN?`


df.isna().any().any()

#output
# True

②isnull() isna method alias Even if the above isna is replaced with isnull, the process is exactly the same.
>> Link to official page - [isna()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isna.html) - [all()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.all.html) - [any()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.any.html) - [isnull()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isnull.html)
## 2. Count the number of missing values ①isna().sum() ②isna().sum().sum() ③len() - count()
①isna().sum() Count how many NaNs each row or column contains.

Whether it is a row or a column is switched by the presence or absence of the option "axis = 1".

・ Df.isna (). Sum (): Count number per ** column ** ・ Df.isna (). Sum (axis = 1): Count number per ** line **

** ▼ Original table ** ![image.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/563526/bd35b75c-b12b-62fc-75b3-339b4a4f126e.png)
** ▼ Count NaN in column ** `df.isna().sum()`

`Count the NaN contained in each column`


df.isna().sum()

#output

col1          1
Unnamed: 1    2
col3          1
col4          6
Unnamed: 4    6
col6          2
dtype: int64

** ▼ Count NaN in line ** `df.isna().sum(axis=1)`

`Count NaN contained in each row`


df.isna().sum(axis=1)

#output

0    2
1    3
2    3
3    2
4    6
5    2
dtype: int64

②isna().sum().sum() Count the number of NaNs contained in the entire table.
** ▼ What you are doing ** Count the number of NaNs for each row with ʻisna (). Sum ()`.

Furthermore, by doing sum (), the NaN number of each line is added up.

`Number of NaNs in the entire table`


df.isna().sum().sum()

#output
# 18

③len() - count()

len (df ['line name']) --count (df ['line name'])

Find the number of NaNs in the specified row or column.

Target is one column or one row

** ▼ What you are doing ** ① `len (df ['line name'])` └ Count the number of elements in the specified row

② count (df ['line name']) └ Number of elements other than NaN contained in the specified line

By setting (1)-(2), the number of NaNs contained in the specified line is calculated.

Supplement
count method: counts the number of elements other than NaN

** ▼ Target column (column containing two NaN. Column name: col6) **

`Target column`


df.col6

0      A
1    NaN
2      C
3      D
4    NaN
5      F

** ▼ Execution result **

`Number of NaNs in the specified row`


len(df['col6'])-df['col6'].count()

#output
# 2

>> Official page -[sum method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.sum.html) --[count method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.count.html)
## 3. 3. Delete missing values Missing value NaN can be deleted by using the dropna method.

▼ Dropna option list

option	Contents
axis=0	Line (optional)
axis=1	Column
how='any'	Delete if there is even one NaN(Optional)
how='all'	Delete all NaN
thresh=n	Not NaN(Valid)Do not delete if there are n or more cells
subset=['A',,]	Delete rows that contain NaN in the specified column name (remove NaN from the specified column name)
subset=['a',,], axis=1	Delete columns that contain NaN in the specified row name (remove NaN from the specified row name)
inplace=True	Allow overwriting

** Original table ** ![image.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/563526/84b5b6ed-38e4-a9b3-14ac-16e6d18ce605.png)
### 5. Delete all missing matrix -Specify "** how ='all'**" as an option. -Specify the row or column with axis.

① axis = 0 (omitted) 　②axis=1 　③dropna(how='all').dropna(how='all', axis=1)

#### ① axis = 0 (omitted) -Delete "lines" containing spaces. -Default is "axis = 0" (optional) └ The following two processes are the same.

dropna(how='all') dropna(how='all', axis=0)

`Remove all blank lines`


df.dropna(how='all')

▽ Output

Line 4 deleted.

②axis=1 dropna(how='all', axis=1)

Delete "columns" that contain spaces.

`Remove all blank columns`


df.dropna(how='all')

▽ Output

Removed two columns, col4 and Unnamed: 4.

③dropna(how='all').dropna(how='all', axis=1)　

Removed rows and columns where all elements are blank.

After removing blank lines with dropna (how ='all') Remove blank columns with dropna (how ='all', axis = 1).

`Remove blank rows and columns`


df.dropna(how='all').dropna(how='all', axis=1)

▽ Output

Row 4, column col4, Unnamed: 4 deleted.

### 6. Delete a matrix that contains at least one missing value -Specify "** how ='any' **" as an option. ・ The default is this state (optional) -Specify the row or column with axis. -The original table is not overwritten └ Overwrite setting is also possible as an option.

① Default 　②axis=1

** ▼ Original table **

There are four NaNs. (Row, Column) = (4,1), (5,2), (1,5), (4,5)

#### ① Default ** Deleted any line containing NaN. ** **

The default of the dropna method is that how ='any' is specified as an option.

The following three are the same. dropna() dropna(how='any') dropna(how='any', axis=0)

** ▼ Execution example **

In each case, lines 1, 4, 5, and 6 including NaN have been deleted.

②axis=1 Removed ** columns ** containing at least one NaN.

Since how ='any' can be omitted, the following two processes are the same.

dropna(axis=1) dropna(how='any', axis=1)

** ▼ Execution example **

All contain one or more NaNs, col1, Unnamed: 1, col5 deleted.

## 6. Delete by specifying the number of cells that are not missing values Example: Ignore if there are n or more non-missing values. If it is less than that, delete it.

①df.dropna(thresh=n) ②df.dropna(thresh=n, axis=1)

▽ Supplement thresh is an abbreviation for "threshold".

### ▼ Original data

・ 10 rows x 6 columns table -The number of NaN is 1 in the 0th row (9 cells are not). The number of NaNs increases one column at a time. ・ All 4th columns are blank

①df.dropna(thresh=n) Example: `df.dropna (thresh = 1)` └ "df": Table data └ "thresh = 1": Threshold 1

Deleted rows with one or less cells other than NaN. Same as deleting all NaN lines.

** ▼ When thresh = 1 **

`Specifying the threshold (n)=1）`


df.dropna(thresh=1)

There are no NaN-only cells, so all rows remain.

** ▼ When thresh = 4 **

`Specifying the threshold (n)=4）`


df.dropna(thresh=4)

Rows with 4 or less cells other than NaN, rows 1, 2 and 3 are deleted.

** ▼ When thresh = 7 **

`Specifying the threshold (n)=7）`


df.dropna(thresh=7)

Since there are no rows with 7 or more cells other than NaN, delete all.

②df.dropna(thresh=n, axis=1) Example: `df.dropna (thresh = 1, axis = 1)` └ "df": Table data └ "thresh = 1": Threshold 1

Deleted columns with one or less cells other than NaN. Same as deleting all NaN columns.

** ▼ When thresh = 1, axis = 1 **

`Specifying the threshold (n)=1）`


df.dropna(thresh=1, axis=1)

All NaN column 4 removed.

** ▼ When thresh = 7, axis = 1 **

`Specifying the threshold (n)=7）`


df.dropna(thresh=7, axis=1)

Columns with 7 or less cells other than NaN, rows 4, 5, and 6 are deleted.

** ▼ When thresh = 11, axis = 1 **

`Specifying the threshold (n)=11）`


df.dropna(thresh=11, axis=1)

Delete all because there are no columns with 11 or more cells other than NaN.

## 7. Specify the row / column for which you want to delete NaN Use the subset option. The specified row / column is not deleted, but the NaN contained in it is deleted.

If you want to delete a row, specify the column name with subset.
If you want to delete the column, specify the row name with subset.

①df.dropna(subset=['AAA', 'BBB']) ②df.dropna(subset=['aaa', 'bbb'], axis=1)

### ▼ Original data |![image.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/563526/d1a720eb-4f71-9f8e-ffd3-d7bc3d357768.png)| |:--|

・ 10 rows x 6 columns ・ The row name is row x row number ・ Column name is col x column number

①df.dropna(subset=['AAA', 'BBB']) Delete the row by specifying the column name.

df.dropna(subset=['AAA', 'BBB']) └ "df": Variable with table └ "subset": Option to specify the name of the table or column └ "AAA" "BBB": Column name (optional)

-Whether the deletion target is row or column is specified by the presence or absence of axis.

** ▼ Delete specified line ** `df.dropna(subset=['col1','col3','col5'])` └ Delete the row that has at least one NaN in columns col1, col3, col5. ⇛ NaN disappears from the specified column.

`Delete row by specifying by column name`


df.dropna(subset=['col1','col3','col5'])

Lines 1,2,3,4,5 deleted. The specified columns col1,3,5 no longer contain NaN.

②df.dropna(subset=['aaa', 'bbb'], axis=1) Delete the column by specifying the row name. NaN is deleted from the specified line.
`df.dropna(subset=['row0','row3'], axis=1)`

Delete the column with NaN in rows 0 and row 3. (NaN disappears from the specified column)

`Delete row by specifying by column name`


df.dropna(subset=['row0','row3'], axis=1)

Removed columns col3, col4, col5.

## 8. Allow delete overwrite Use the inplace option.

Do not overwrite by default. 　└inplace=False

Describe the following to allow overwriting. 　└**inplace=True**

** ▼ Allow overwriting **

`Overwrite permission`


df.dropna(thresh=5, inplace=True)
df

** ▼ Default (do not overwrite) **

`No overwrite`


df.dropna(thresh=5)
df

The table remains the same.

>> Official page [dropna method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html?highlight=dropna)
## 9. Convert missing values You can replace NaN with the specified value with the fillna method.

The replacement method and the conditions of NaN to be replaced are specified as options.

### ▼ Fillna option list

option	Contents
axis=0	Line (omitted)
axis=1	Column
value	Value to convert NaN
method='bfill'	Convert to lower cell value of NaN
method='backfill'	'bfill'Same as
method='ffill'	Convert NaN to the value in the upper cell
method='pad'	'ffill'Same as
method='bfill', axis=1	Convert to NaN right cell value
method='ffill', axis=1	Convert to the value in the left cell of NaN
limit=n	Replace up to nth NaN. Do not replace any more.
inplace=True	Allow overwriting

** ① Replace the entire table ** 　1-1. fillna('AAA') 1-2. method option 　1-3. fillna(method='ffill') 　1-4. fillna(method='ffill', axis=1) 　1-5. fillna(method='bfill') 　1-6. fillna(method='bfill', axis=1)

** ② Specify the number of NaN to replace ** 　1-1. fillna('AAA', limit=n) 　1-2. fillna('AAA', limit=n, axis=1) 　1-3. fillna(method='ffill', limit=n) 　1-4. fillna(method='ffill', limit=n, axis=1)

** ③ Allow overwriting **

### ▼ Original data

## ① Replace the entire table 1-1. fillna('AAA') `fillna('AAA')` └ "AAA": Replace NaN with the specified character (optional)

`fillna('AAA')`


df.fillna('AAA')

### 1-2. method option It can be replaced with a * value adjacent to NaN by using the method option.

If NaN is continuous, proceed to the cell with the value.

** ▼ Example: When you specify to replace with the value in the right cell **

Before replacement: NaN NaN NaN AAA After replacement: AAA AAA AAA AAA

If there is no cell with a value, it will be NaN.

** ▼ method option list **

method	meaning	Contents
ffill	forward fill	Replace with the value in the front cell (row: top, column: left)
pad	padding	Same as ffill
bfill	backward fill	Replace with the value in the back cell (row: bottom, column: right)
backfill	backward fill	Same as bfill

When specifying a line, determine whether it is before or after based on the line number. (Up or down)

When specifying a column, determine whether it is the front or the back based on the column number. (Left or right)

The method option cannot be used together with value (an error will occur).

`error`


ValueError: Cannot specify both 'value' and 'method'.

1-3. fillna(method='ffill') `fillna(method='ffill')`

Replace with the value in the top row.

`ffill`


df.fillna(method='ffill')

Convert to a value above NaN. Not replaced if the top row is NaN.

1-4. fillna(method='ffill', axis=1) `fillna(method='ffill', axis=1)`

Replace with the value in the front column.

`ffill,axis=1`


df.fillna(method='ffill', axis=1)

1-5. fillna(method='bfill') `fillna(method='bfill')`

Replace with the value in the last row.

`bfill`


df.fillna(method='bfill')

1-6. fillna(method='bfill', axis=1) `fillna(method='bfill', axis=1)`

Replace with the value in the back column.

`bfill,axis=1`


df.fillna(method='bfill', axis=1)

(2) Specify the number of NaN to replace

By using the limit option, you can specify the number of NaN to replace.

limit=n Replace up to nth. Do not replace any more.

** Example: When limit = 2, the value to be replaced AAA, and the axis is a column **

■ Before replacement AAA NaN NaN AAA AAA AAA NaN NaN NaN AAA AAA NaN AAA NaN AAA NaN NaN AAA NaN NaN

■ After replacement AAA AAA AAA AAA AAA AAA AAA AAA NaN AAA AAA AAA AAA AAA AAA AAA AAA AAA NaN NaN

-Replace each line from the left up to the second NaN.

1-1. fillna('AAA', limit=n) `fillna('AAA', limit=3)`

From the top in line number order, NaN replaces up to the second.

`limit=2`


df.fillna('AAA', limit=2)

Replace from top to bottom for each column. Do not replace any more.

1-２. fillna('AAA', limit=n, axis=1) `fillna('AAA', limit=3, axis=1)`

Replace from the top to the third NaN in line number order.

`limit=3,axis=1`


df.fillna('AAA', limit=3, axis=1)

1-3. fillna(method='ffill', limit=n) `fillna(method='ffill', limit=3)`

Replaces up to the third NaN in line number order with the previous value.

If the beginning is NaN, it will not be replaced (it will not be counted). The replacement starts after the cell with the first value.

`ffill,limit=3`


df.fillna(method='ffill', limit=3)

1-4. fillna(method='ffill', limit=n, axis=1) `fillna(method='ffill', limit=3, axis=1)`

Replace up to the third NaN in column number order with the previous value.

If the beginning is NaN, it will not be replaced (it will not be counted). The replacement starts after the cell with the first value.

`ffill,limit=3,axis=1`


df.fillna(method='ffill', limit=3)

### ③ Allow overwriting Use the inplace option.

Do not overwrite by default. 　└inplace=False

Describe the following to allow overwriting. 　└**inplace=True**

** ▼ Allow overwriting **

`Overwrite permission`


df.fillna('AAA', inplace=True)
df

** ▼ Default (do not overwrite) **

`No overwrite`


df.fillna('AAA')
df

The table remains the same.

>> Official page [dropna method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html?highlight=dropna)
[Back to top](#python How to extract, delete, and convert a matrix containing missing value nan)