This is the 3rd time of a project to make a note of the contents of hands-on that everyone will challenge the famous theme "House Price" problem of kaggle. It's more of a memo than a commentary, but I hope it helps someone somewhere.
In conclusion, there were quite a few missing values. However, if you take a closer look, it does not mean that it is not missing, and that "there is no value" is meaningful in the first place.
Excerpt from a reference article.
When you download the data from Kaggle, you will notice that it also contains a file called "data_description.txt". This file details what data is stored in the variables. Then we know that the majority of deficiencies do not mean that there is no information, but that the deficiencies themselves are information. For example, take a look at PoolQC (pool quality), which has the most defects. The loss of this variable means that the pool does not exist in the house, and the data loss itself is information. For other variables (categorical variables) as well, a deficiency simply means that the facility or equipment does not exist. Also, for numeric variables, the deficiency only means that the occupied area is zero, and it is not without information. Therefore, the following completion is performed for the loss of categorical variables and numeric variables.
Apparently, it is a system item that expresses the meaning with the code in the DB value. Example) 1: Male, 2: Female, etc. https://www1.doshisha.ac.jp/~mjin/R/Chap_45/45.html
That is, it looks like data that simply represents a quantitative value that is the opposite of a categorical variable.
First of all, when storing it, the index value is taken out for each data type (I try to configure it so that I can understand the meaning when I finish it even if I do not know what I am saying).
na_float_cols = alldata[na_col_list].dtypes[alldata[na_col_list].dtypes=='float64'].index.tolist()
ʻAlldata [na_col_list] `: A data array containing missing values.
alldata[na_col_list].dtypes=='float64'
Check each data type of the array. Check the data types of the array at once with .dtypes. The following is the result of only ʻalldata [na_col_list] .dtypes`.
https://note.nkmk.me/python-numpy-dtype-astype/
alldata[na_col_list].dtypes[alldata[na_col_list].dtypes=='float64']
Get items only for numeric variables. The following is the result of only ʻalldata [na_col_list] .dtypes`. It seems to sort out whether this is responsible for float64.
.index This is also a memo because the role of .index was different from what I expected. I've been looking at "setting the index" so far, but this time it seems that it is used for "getting the index". Reference: https://www.mathpython.com/en/python-list-index/ The following is the output result of ʻalldata [na_col_list] .dtypes [alldata [na_col_list] .dtypes =='float64']. Index`. I see, only the index is taken.
.tolist() Convert the acquired index to list type. In the first place, are there many types of Python that look like arrays? .. .. I was about to get stuck at that point, so make a note of it as well. Reference: https://note.nkmk.me/python-numpy-list/ Reference: https://algorithm.joho.info/programming/python/list-tuple-dict-chigai/ The following is the output of ʻalldata [na_col_list] .dtypes [alldata [na_col_list] .dtypes =='float64']. index.tolist () `as usual. Oh, now you can finally get a numeric column as a list type.
It's been free for about two weeks, but I'll do my best to update it again. (It's about time I want to input Python from the basics and reorganize it ..., python seems to overdo everything in one line ...)