About this article

I wrote a function to quickly calculate the tree of ** Fig1. Analytical subjects ** that I see in epidemiological studies.
Assume a tree when listwise removal is performed.
The code is Python, but I added how to call it from R in the article.
I just implemented the analytic function in SQL in python.

What is a listwise removal tree in the first place?

Such a guy ↓ It took about half an hour to build a chewy function in Excel. (You'll want to hurt your eyes and hips)

(Please tell us if there is a formal name in this figure)

What's the hassle?

If you just want to put the missing data of each variable, it ends with ** df.isnull.sum () **, but ...

The missing data for ** x1 ** was ● people.
In the data excluding the missing data of ** x1 **, the missing data of ** x2 ** was ▲ people.
In the data excluding the missing data of ** x1 ** and ** x2 **, the missing data of ** x3 ** was ■ people. Four. ···

You need to write something like an analytic function in SQL.

Ah, it's a hassle (in python).

Then to the main subject

`python`


import pandas as pd
import numpy as np

def caluculate_missing_tree(df):
    d ={}
    d[0]= df.loc[df[df.columns[0]].isnull() != True]
    for i in range(len(df.columns)-1):
        d[1+i]= d[i].loc[d[i][d[i].columns[1+i]].isnull() != True]

    le = []
    colnames = []
    missing_tree = pd.DataFrame()

    for i in range(len(df.columns)):
        le.append(len(d[i]))
    for i in range(len(df.columns)):
        colnames.append(df.columns[i])


    missing_tree['col_name'] = colnames
    missing_tree['Size'] = le

    return missing_tree

Just insert a dataframe containing variables in the order you want to draw the tree into the argument of ** caluculate_missing_tree () **.

For example, try with ** titanic ** data.

`python`



import pandas as pd 
import numpy as np
import os 

df = pd.read_csv("train.csv")
df.shape #(891, 12)

df.isnull().sum()  #Missing data for each variable

--------------------------------
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

If you feed this one to this function ...

`python`




caluculate_missing_tree(df)

--------------------------------------

	col_name	Size
0	PassengerId	891
1	Survived	891
2	Pclass  	891
3	Name	    891
4	Sex	        891
5	Age	        714
6	SibSp	    714
7	Parch	    714
8	Ticket	    714
9	Fare	    714
10	Cabin	    185
11	Embarked	183

I was able to calculate in an instant. happy.

Description of contents

The idea is that **. loc ** should be used to create more and more datasets that meet the conditions (not missing data).

`python`


df  <-original data
df1 = df.loc[df['x1'].isnull() != True]] <-Data with x1 missing data removed
df2 = df1.loc[df1['x2'].isnull() != True]] <- x1,Data excluding x2 missing data
df3 = df2.loc[df2['x3'].isnull() != True]] <- x1, x2,Data excluding x3 missing data
...
...

Like this.

Furthermore, when I think about writing ** for sentence **, it looks like this.

`python`


d[0]= df.loc[df[df.columns[0]].isnull() != True]　<-This is outside the for statement

---Image of for from here---
d[1]= d[0].loc[d[0][d[0].columns[1]].isnull() != True]
d[2]= d[2-1].loc[d[2-1][d[2-1].columns[2]].isnull() != True]
d[3]= d[3-1].loc[d[3-1][d[3-1].columns[3]].isnull() != True]

However, it was a little difficult to automate the creation of df with ** for statement **.

Create a list to store multiple data frames. I used the method of storing the data frame corresponding to each variable there.

`python`


    d ={}
    d[0]= df.loc[df[df.columns[0]].isnull() != True]
    for i in range(len(df.columns)-1):
        d[1+i]= d[i].loc[d[i][d[i].columns[1+i]].isnull() != True]

Like this. For example, in the ** titanic ** data, ・ ** d [0] is Passenger ID ** ・ ** d [1] is Passenger ID, Survived ** -** d [2] corresponds to missing data of passengerID, Survived, Pclass **.

After that, it would be ** inevitable ** to come up with the idea of creating a data frame with ** variable names ** and ** sample size ** in the column names for ease of confirmation.

`python`



    le = []
    colnames = []
    missing_tree = pd.DataFrame()

    for i in range(len(df.columns)):
        le.append(len(d[i]))
    for i in range(len(df.columns)):
        colnames.append(df.columns[i])


    missing_tree['col_name'] = colnames
    missing_tree['Size'] = le

    return missing_tree

The len (df.columns) of the data frame in which the missing values of each variable were deleted was stored in ** le **. Similarly, the variable names corresponding to each data frame were stored in ** colnames ** and visualized.

`python`




caluculate_missing_tree(df)

--------------------------------------

	col_name	Size
0	PassengerId	891
1	Survived	891
2	Pclass  	891
3	Name	    891
4	Sex	        891
5	Age	        714
6	SibSp	    714
7	Parch	    714
8	Ticket	    714
9	Fare	    714
10	Cabin	    185
11	Embarked	183

Jajan (second time)

Implementation method in R

Use Rnotebook and ** reticulate library **. (May be added)

A function that easily calculates a listwise removal tree (Python)

About this article

What is a listwise removal tree in the first place?

What's the hassle?

Then to the main subject

python

python

python

Description of contents

python

python

python

python

python

Implementation method in R

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`