[Hands-on for beginners] Read kaggle's "Predicting House Prices" line by line (6th: Distribution conversion of objective variables)

theme

This is the 6th project to make a note of the contents of hands-on, where everyone will challenge the famous "House Price" problem of kaggle. It's more of a memo than a commentary, but I hope it helps someone somewhere. The preparation was completed last time, and it is finally in the analysis stage.

Today's work

Distribution transformation of objective variable

Check the distribution of SalePrice (house price) of the training data. It was found that most homes do not have a pool at the point of filling up the deficiency. This means that there are some mansions that have pools on the flip side, and the distribution of house prices may be quite distorted. Is assumed.

I recall that it is important to draw based on such temporary construction. However, first of all, the graph is output as it is said.

sns.distplot(train['SalePrice'])

About seaborn

"What is sns?" I forgot it after the beginning, but it was in the library I was importing first. This.

import seaborn as sns

I see seaborn

Check what was in train ['Sale Price']

After that, just in case, check the contents in train ['Sale Price']. I see, the rows where each is lined up. スクリーンショット 2020-06-29 12.07.02.png

Output graph

And the output graph looks like this.

sns.distplot(train['SalePrice'])

image.png

Logarithmic conversion

As expected, the base of the distribution extends to the far right. By performing logarithmic conversion, it approaches a normal distribution.

However, confirmation of "What is logarithmic conversion?"

sns.distplot(np.log(train['SalePrice']))

Array changes before and after logarithmic conversion

I will output this much.

np.log(train['SalePrice'])

I see, it's crushed. スクリーンショット 2020-06-29 12.17.28.png

Output graph 2

sns.distplot(np.log(train['SalePrice']))

image.png

I feel that it has a fairly normal distribution.

Building a predictive model

I wanted to enter, but apparently it smells like the time has run out, so that's it for today.

Since the amount of variables is quite large this time, we want to impose a strong penalty on the coefficients, so we will build a prediction model using Lasso regression.

After the preparation, I investigated the Lasso regression and finished.

Lasso regression

That's it.

After entering the analysis layer, I realized that it was necessary to supplement the background knowledge. Mainly about regression analysis.

Recommended Posts

[Hands-on for beginners] Read kaggle's "Predicting House Prices" line by line (6th: Distribution conversion of objective variables)
[Hands-on for beginners] Read kaggle's "Predicting Home Prices" line by line (Part 5: Dummy categorical variables)
[Hands-on for beginners] Read kaggle's "Predicting Home Prices" line by line (7th: Preparing to build a prediction model)
[Hands-on for beginners] Read kaggle's "Predicting Home Prices" line by line (Part 2: Checking Missing Values)
[Hands-on for beginners] Read kaggle's "Predicting Home Prices" line by line (Part 3: Preparation for missing value complementation)
[Hands-on for beginners] Read kaggle's "Forecasting Home Prices" line by line (8th: Building a Forecast Model)
[Hands-on for beginners] Read kaggle's "Forecasting Home Prices" line by line (4th: Complementing Missing Values (Complete))
[Hands-on for beginners] Read kaggle's "Forecasting Home Prices" line by line (Part 1: Reading data)