This time, we will solve the Sale Price forecasting problem. At first, I would like to make a prediction from a very simple first-order regression equation. Originally, it is the real pleasure to process and optimize a large number of features, but I would like to start by making a simple prediction.
Reference URL https://www.kaggle.com/katotaka/kaggle-prediction-house-prices
The version used is here.
Python 3.7.6 numpy 1.18.1 pandas 1.0.1 matplotlib 3.1.3 scikit-learn 0.22.1
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
import seaborn as sns
#Settings for inline display in Jupyter Notebook (without this, the graph will open in a separate window)
%matplotlib inline
I imported pandas for loading csv, numpy for processing sequences, matplotlib and seaborn for graph drawing, and sklern.linear_model for regression.
df = pd.read_csv("train.csv")
df
It is not possible to display all at once due to the large amount of features, but many housing conditions (area, facing the road, having a pool), etc. are listed. It evaluates and predicts whether these conditions affect the sale price.
corrmat = df.corr()
k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(df[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()
I would like to find a feature with a high correlation coefficient with respect to the house price. Let's take a look at the seaborn heatmap. It can be seen that the feature quantity with a high correlation coefficient with SalePrice is Overall Qual (overall quality). It's easy to understand that the higher the quality, the higher the price.
X = df[["OverallQual"]].values
y = df["SalePrice"].values
slr = LinearRegression()
slr.fit(X,y)
#Scatter plot creation
plt.scatter(X,y)
plt.xlabel('OverallQual')
plt.ylabel('House Price($)')
#Display of approximate curve
plt.plot(X, slr.predict(X), color='red')
#graph display
plt.show()
I made a graph of the relationship between Overall Qual and Sale Price. The general trend is correct. However, where the Overall Qual is low, it is underestimated. Also, it can be seen that there is a large variation where the Overall Qual is high. I think that these can be predicted more precisely by other features, but this time we will predict them as they are.
#Read test data
df_test = pd.read_csv('test.csv')
#Set the Overall Qual value of the test data to X_Set to test
X_test = df_test[["OverallQual"]].values
y_test_pred = slr.predict(X_test)
df_test[["Id", "SalePrice"]].to_csv("submission.csv", index = False)
The SCORE when sent to kaggle was 0.84342 (out of 4720 teams in 4563th place). From the next article, I would like to analyze it in detail and make a good score.
Recommended Posts