[Hands-on for beginners] Read kaggle's "Predicting Home Prices" line by line (Part 2: Checking Missing Values)

theme

Click here for the first content This is the second part of a project to make a note of the contents of hands-on, in which everyone will challenge the famous "House Price" problem of kaggle. It's more of a memo than a commentary, but I hope it helps someone somewhere.

Today's work

Confirmation of missing values (cannot be completed)

In conclusion, it seems that there are quite a few missing values.

Missing status of training data (missing value)

train.isnull().sum()[train.isnull().sum()>0].sort_values(ascending=False)

Missing value

When preparing the data file, you must enter some numerical value even if the data is missing. However, the entered numerical value indicates that there was actually no data, so it is necessary to exclude it from the analysis target. Therefore, enter a value (missing value) that can be clearly distinguished from other valid data.

.isnull()

.sum()

.sort_values()

Missing test data

The explanation is the same as the learning data, so I will omit it.

test.isnull().sum()[test.isnull().sum()>0].sort_values(ascending=False)

Missing status of training data (data type)

.index.tolist()

#Check the data type of the column containing the missing
na_col_list = alldata.isnull().sum()[alldata.isnull().sum()>0].index.tolist() #List columns containing defects
alldata[na_col_list].dtypes.sort_values() #Data type

.dtypes

スクリーンショット 2020-05-25 12.21.40.png

Understanding and dealing with deficiency situations

This is a description of opinions on how to handle data statistically. We recommend that you read and understand it normally. A story different from programming understanding.

Both training data and test data are considerably missing. In such a case, you will want to delete the column with many defects. But before that, Kaggle has a document that details variables, so let's take a look at it first. When you download the data from Kaggle, you will notice that it also contains a file called "data_description.txt". This file details what data is stored in the variables. Then we know that the majority of deficiencies do not mean that there is no information, but that the deficiencies themselves are information. For example, take a look at PoolQC (pool quality), which has the most defects. The loss of this variable means that the pool does not exist in the house, and the data loss itself is information. For other variables (categorical variables) as well, a deficiency simply means that the facility or equipment does not exist. Also, for numeric variables, the deficiency only means that the occupied area is zero, and it is not without information. Therefore, the following completion is performed for the loss of categorical variables and numeric variables.

That's it.

Hmmm. I just looked at the data.

Recommended Posts

[Hands-on for beginners] Read kaggle's "Predicting Home Prices" line by line (Part 2: Checking Missing Values)
[Hands-on for beginners] Read kaggle's "Predicting Home Prices" line by line (Part 3: Preparation for missing value complementation)
[Hands-on for beginners] Read kaggle's "Predicting Home Prices" line by line (Part 5: Dummy categorical variables)
[Hands-on for beginners] Read kaggle's "Forecasting Home Prices" line by line (4th: Complementing Missing Values (Complete))
[Hands-on for beginners] Read kaggle's "Forecasting Home Prices" line by line (Part 1: Reading data)
[Hands-on for beginners] Read kaggle's "Predicting Home Prices" line by line (7th: Preparing to build a prediction model)
[Hands-on for beginners] Read kaggle's "Forecasting Home Prices" line by line (8th: Building a Forecast Model)
[Hands-on for beginners] Read kaggle's "Predicting House Prices" line by line (6th: Distribution conversion of objective variables)
Predicting Home Prices (Regression by Linear Regression (kaggle)) ver1.0
How to check for missing values (Kaggle: House Prices)