Click here for the first content This is the second part of a project to make a note of the contents of hands-on, in which everyone will challenge the famous "House Price" problem of kaggle. It's more of a memo than a commentary, but I hope it helps someone somewhere.
In conclusion, it seems that there are quite a few missing values.
train.isnull().sum()[train.isnull().sum()>0].sort_values(ascending=False)
When preparing the data file, you must enter some numerical value even if the data is missing. However, the entered numerical value indicates that there was actually no data, so it is necessary to exclude it from the analysis target. Therefore, enter a value (missing value) that can be clearly distinguished from other valid data.
.isnull()
.sum()
.sum (): Familiar addition. It adds up both vertically and horizontally by specifying an argument.
Reference: https://deepage.net/features/pandas-sum.html
When the result is output only by train.isnull (). Sum ()
[train.isnull (). sum ()> 0]: Feeling that only columns with missing items are specified as keys and arranged.
When the result is output only by train.isnull (). Sum () [train.isnull (). Sum ()> 0]
.sort_values()
The explanation is the same as the learning data, so I will omit it.
test.isnull().sum()[test.isnull().sum()>0].sort_values(ascending=False)
.index.tolist()
#Check the data type of the column containing the missing
na_col_list = alldata.isnull().sum()[alldata.isnull().sum()>0].index.tolist() #List columns containing defects
alldata[na_col_list].dtypes.sort_values() #Data type
na_col_list = alldata.isnull (). Sum () [alldata.isnull (). Sum ()> 0] .index.tolist ()
.dtypes
This is a description of opinions on how to handle data statistically. We recommend that you read and understand it normally. A story different from programming understanding.
Both training data and test data are considerably missing. In such a case, you will want to delete the column with many defects. But before that, Kaggle has a document that details variables, so let's take a look at it first. When you download the data from Kaggle, you will notice that it also contains a file called "data_description.txt". This file details what data is stored in the variables. Then we know that the majority of deficiencies do not mean that there is no information, but that the deficiencies themselves are information. For example, take a look at PoolQC (pool quality), which has the most defects. The loss of this variable means that the pool does not exist in the house, and the data loss itself is information. For other variables (categorical variables) as well, a deficiency simply means that the facility or equipment does not exist. Also, for numeric variables, the deficiency only means that the occupied area is zero, and it is not without information. Therefore, the following completion is performed for the loss of categorical variables and numeric variables.
Hmmm. I just looked at the data.