Such a guy ↓ It took about half an hour to build a chewy function in Excel. (You'll want to hurt your eyes and hips)
(Please tell us if there is a formal name in this figure)
If you just want to put the missing data of each variable, it ends with ** df.isnull.sum () **, but ...
You need to write something like an analytic function in SQL.
Ah, it's a hassle (in python).
python
import pandas as pd
import numpy as np
def caluculate_missing_tree(df):
d ={}
d[0]= df.loc[df[df.columns[0]].isnull() != True]
for i in range(len(df.columns)-1):
d[1+i]= d[i].loc[d[i][d[i].columns[1+i]].isnull() != True]
le = []
colnames = []
missing_tree = pd.DataFrame()
for i in range(len(df.columns)):
le.append(len(d[i]))
for i in range(len(df.columns)):
colnames.append(df.columns[i])
missing_tree['col_name'] = colnames
missing_tree['Size'] = le
return missing_tree
Just insert a dataframe containing variables in the order you want to draw the tree into the argument of ** caluculate_missing_tree () **.
For example, try with ** titanic ** data.
python
import pandas as pd
import numpy as np
import os
df = pd.read_csv("train.csv")
df.shape #(891, 12)
df.isnull().sum() #Missing data for each variable
--------------------------------
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
If you feed this one to this function ...
python
caluculate_missing_tree(df)
--------------------------------------
col_name Size
0 PassengerId 891
1 Survived 891
2 Pclass 891
3 Name 891
4 Sex 891
5 Age 714
6 SibSp 714
7 Parch 714
8 Ticket 714
9 Fare 714
10 Cabin 185
11 Embarked 183
I was able to calculate in an instant. happy.
The idea is that **. loc ** should be used to create more and more datasets that meet the conditions (not missing data).
python
df <-original data
df1 = df.loc[df['x1'].isnull() != True]] <-Data with x1 missing data removed
df2 = df1.loc[df1['x2'].isnull() != True]] <- x1,Data excluding x2 missing data
df3 = df2.loc[df2['x3'].isnull() != True]] <- x1, x2,Data excluding x3 missing data
...
...
Like this.
Furthermore, when I think about writing ** for sentence **, it looks like this.
python
d[0]= df.loc[df[df.columns[0]].isnull() != True] <-This is outside the for statement
---Image of for from here---
d[1]= d[0].loc[d[0][d[0].columns[1]].isnull() != True]
d[2]= d[2-1].loc[d[2-1][d[2-1].columns[2]].isnull() != True]
d[3]= d[3-1].loc[d[3-1][d[3-1].columns[3]].isnull() != True]
However, it was a little difficult to automate the creation of df with ** for statement **.
Create a list to store multiple data frames. I used the method of storing the data frame corresponding to each variable there.
python
d ={}
d[0]= df.loc[df[df.columns[0]].isnull() != True]
for i in range(len(df.columns)-1):
d[1+i]= d[i].loc[d[i][d[i].columns[1+i]].isnull() != True]
Like this. For example, in the ** titanic ** data, ・ ** d [0] is Passenger ID ** ・ ** d [1] is Passenger ID, Survived ** -** d [2] corresponds to missing data of passengerID, Survived, Pclass **.
After that, it would be ** inevitable ** to come up with the idea of creating a data frame with ** variable names ** and ** sample size ** in the column names for ease of confirmation.
python
le = []
colnames = []
missing_tree = pd.DataFrame()
for i in range(len(df.columns)):
le.append(len(d[i]))
for i in range(len(df.columns)):
colnames.append(df.columns[i])
missing_tree['col_name'] = colnames
missing_tree['Size'] = le
return missing_tree
The len (df.columns) of the data frame in which the missing values of each variable were deleted was stored in ** le **. Similarly, the variable names corresponding to each data frame were stored in ** colnames ** and visualized.
python
caluculate_missing_tree(df)
--------------------------------------
col_name Size
0 PassengerId 891
1 Survived 891
2 Pclass 891
3 Name 891
4 Sex 891
5 Age 714
6 SibSp 714
7 Parch 714
8 Ticket 714
9 Fare 714
10 Cabin 185
11 Embarked 183
Jajan (second time)
Recommended Posts