We'll solve the Sale Price forecasting problem used as a tutorial in Kaggle. Last time, we made a simple prediction by linear regression, but this time we will perform data preprocessing such as missing value storage, categorical variable conversion, and specification of new features. We will do machine learning and prediction in the second half.
Latter half https://qiita.com/Fumio-eisan/items/7e13695ef5ccc6acf61c
I used this as a reference. https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard
First, download the train, test.csv data from the URL below to any folder. https://www.kaggle.com/c/house-prices-advanced-regression-techniques
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
%matplotlib inline
import matplotlib.pyplot as plt # Matlab-style plotting
import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')
import warnings
def ignore_warn(*args, **kwargs):
pass
warnings.warn = ignore_warn #ignore annoying warning (from sklearn and seaborn)
from scipy import stats
from scipy.stats import norm, skew #for some statistics
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x)) #Limiting floats output to 3 decimal points
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
This time, first read both train and test data.
train_ID = train['Id']
test_ID = test['Id']
train.drop("Id", axis = 1, inplace = True)
test.drop("Id", axis = 1, inplace = True)
The Id column is unnecessary, so delete it and define it as a separate variable.
fig, ax = plt.subplots()
ax.scatter(x = train['GrLivArea'], y = train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()
Looking at the GrLivArea and SalePrice, there are two outlier plots where the GrLivArea is 4000 or higher and the SalePrice is less than 300,000. This is a factor that makes it difficult to predict, so delete it.
#Deleting outliers
train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index)
#Check the graphic again
fig, ax = plt.subplots()
ax.scatter(train['GrLivArea'], train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()
I was able to delete it. However, I don't think this deletion should be done easily. I think it is desirable to consider the meaning of the outliers as data.
Next, log the SalePrice and process it so that it rides on a normal distribution. I can't really explain the detailed reason for this, so I would like to summarize it separately.
https://qiita.com/ttskng/items/2a33c1ca925e4501e609 https://ishitonton.hatenablog.com/entry/2019/02/24/184253
#We use the numpy fuction log1p which applies log(1+x) to all elements of the column
train["SalePrice"] = np.log1p(train["SalePrice"])
#Check the new distribution
sns.distplot(train['SalePrice'] , fit=norm);
# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))
#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')
#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()
Change before
** After change **
I was able to make it a normal distribution by taking the logarithm.
We will handle missing values, determine new features, and make categorical variables dummy variables. First, match the train data and test data.
ntrain = train.shape[0]
ntest = test.shape[0]
y_train = train.SalePrice.values
all_data = pd.concat((train, test)).reset_index(drop=True)
all_data.drop(['SalePrice'], axis=1, inplace=True)
print("all_data size is : {}".format(all_data.shape))
Next, we will process the missing values.
all_data["PoolQC"] = all_data["PoolQC"].fillna("None")
all_data["MiscFeature"] = all_data["MiscFeature"].fillna("None")
all_data["Alley"] = all_data["Alley"].fillna("None")
all_data["Fence"] = all_data["Fence"].fillna("None")
all_data["FireplaceQu"] = all_data["FireplaceQu"].fillna("None")
all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform(
lambda x: x.fillna(x.median()))
for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
all_data[col] = all_data[col].fillna('None')
for col in ('BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath'):
all_data[col] = all_data[col].fillna(0)
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
all_data[col] = all_data[col].fillna('None')
all_data["MasVnrType"] = all_data["MasVnrType"].fillna("None")
all_data["MasVnrArea"] = all_data["MasVnrArea"].fillna(0)
all_data['MSZoning'] = all_data['MSZoning'].fillna(all_data['MSZoning'].mode()[0])
all_data = all_data.drop(['Utilities'], axis=1)
all_data["Functional"] = all_data["Functional"].fillna("Typ")
all_data['Electrical'] = all_data['Electrical'].fillna(all_data['Electrical'].mode()[0])
all_data['KitchenQual'] = all_data['KitchenQual'].fillna(all_data['KitchenQual'].mode()[0])
all_data['Exterior1st'] = all_data['Exterior1st'].fillna(all_data['Exterior1st'].mode()[0])
all_data['Exterior2nd'] = all_data['Exterior2nd'].fillna(all_data['Exterior2nd'].mode()[0])
all_data['SaleType'] = all_data['SaleType'].fillna(all_data['SaleType'].mode()[0])
all_data['MSSubClass'] = all_data['MSSubClass'].fillna("None")
Next, make the categorical variable a dummy variable. ** As a caveat when doing this, make sure that the training and test data are matched first. Then, let's process the matching data. If you perform dummy variable processing as separate data, the number of columns may not match, and you will not be able to predict at the end. ** **
I treated them separately and later sometimes the number of columns didn't match. I tried to force the number of columns to be aligned, but I tried to output the missing and many dummy variables to csv, but I gave up. .. Let's match from the beginning. ..
all_data = pd.get_dummies(all_data)
print(all_data.shape)
That's all for this time. In the second half, we will actually analyze and aim to improve the accuracy of the model.
Latter half https://qiita.com/Fumio-eisan/items/7e13695ef5ccc6acf61c
The full program is here. https://github.com/Fumio-eisan/houseprice20200301