This is the 6th project to make a note of the contents of hands-on, where everyone will challenge the famous "House Price" problem of kaggle. It's more of a memo than a commentary, but I hope it helps someone somewhere. The preparation was completed last time, and it is finally in the analysis stage.
Check the distribution of SalePrice (house price) of the training data. It was found that most homes do not have a pool at the point of filling up the deficiency. This means that there are some mansions that have pools on the flip side, and the distribution of house prices may be quite distorted. Is assumed.
I recall that it is important to draw based on such temporary construction. However, first of all, the graph is output as it is said.
sns.distplot(train['SalePrice'])
"What is sns?" I forgot it after the beginning, but it was in the library I was importing first. This.
import seaborn as sns
I see seaborn
After that, just in case, check the contents in train ['Sale Price']. I see, the rows where each is lined up.
And the output graph looks like this.
sns.distplot(train['SalePrice'])
As expected, the base of the distribution extends to the far right. By performing logarithmic conversion, it approaches a normal distribution.
However, confirmation of "What is logarithmic conversion?"
sns.distplot(np.log(train['SalePrice']))
I will output this much.
np.log(train['SalePrice'])
I see, it's crushed.
sns.distplot(np.log(train['SalePrice']))
I feel that it has a fairly normal distribution.
I wanted to enter, but apparently it smells like the time has run out, so that's it for today.
Since the amount of variables is quite large this time, we want to impose a strong penalty on the coefficients, so we will build a prediction model using Lasso regression.
After the preparation, I investigated the Lasso regression and finished.
After entering the analysis layer, I realized that it was necessary to supplement the background knowledge. Mainly about regression analysis.
Recommended Posts