Introduction

We'll solve the Sale Price forecasting problem used as a tutorial in Kaggle. Last time, we made a simple prediction by linear regression, but this time we will perform data preprocessing such as missing value storage, categorical variable conversion, and specification of new features. We will do machine learning and prediction in the second half.

Latter half https://qiita.com/Fumio-eisan/items/7e13695ef5ccc6acf61c

I used this as a reference. https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard

python　3.7.4
seaborn 0.10.0
numpy 1.18.1
pandas 0.25.3
matplotlib 3.1.3
scipy 1.4.1

Data preprocessing

First, download the train, test.csv data from the URL below to any folder. https://www.kaggle.com/c/house-prices-advanced-regression-techniques

Loading the required libraries


import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
%matplotlib inline
import matplotlib.pyplot as plt  # Matlab-style plotting
import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')
import warnings
def ignore_warn(*args, **kwargs):
    pass
warnings.warn = ignore_warn #ignore annoying warning (from sklearn and seaborn)


from scipy import stats
from scipy.stats import norm, skew #for some statistics


pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x)) #Limiting floats output to 3 decimal points

Data read


train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

This time, first read both train and test data.

Delete Id column



train_ID = train['Id']
test_ID = test['Id']

train.drop("Id", axis = 1, inplace = True)
test.drop("Id", axis = 1, inplace = True)

The Id column is unnecessary, so delete it and define it as a separate variable.

Remove outliers


fig, ax = plt.subplots()
ax.scatter(x = train['GrLivArea'], y = train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

Looking at the GrLivArea and SalePrice, there are two outlier plots where the GrLivArea is 4000 or higher and the SalePrice is less than 300,000. This is a factor that makes it difficult to predict, so delete it.

#Deleting outliers
train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index)

#Check the graphic again
fig, ax = plt.subplots()
ax.scatter(train['GrLivArea'], train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

I was able to delete it. However, I don't think this deletion should be done easily. I think it is desirable to consider the meaning of the outliers as data.

Data standardization

Next, log the SalePrice and process it so that it rides on a normal distribution. I can't really explain the detailed reason for this, so I would like to summarize it separately.

https://qiita.com/ttskng/items/2a33c1ca925e4501e609 https://ishitonton.hatenablog.com/entry/2019/02/24/184253

#We use the numpy fuction log1p which  applies log(1+x) to all elements of the column
train["SalePrice"] = np.log1p(train["SalePrice"])

#Check the new distribution 
sns.distplot(train['SalePrice'] , fit=norm);

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()

Change before

** After change **

I was able to make it a normal distribution by taking the logarithm.

Implementation of Feature Engineering

We will handle missing values, determine new features, and make categorical variables dummy variables. First, match the train data and test data.


ntrain = train.shape[0]
ntest = test.shape[0]
y_train = train.SalePrice.values
all_data = pd.concat((train, test)).reset_index(drop=True)
all_data.drop(['SalePrice'], axis=1, inplace=True)
print("all_data size is : {}".format(all_data.shape))

Missing value handling

Next, we will process the missing values.


all_data["PoolQC"] = all_data["PoolQC"].fillna("None")
all_data["MiscFeature"] = all_data["MiscFeature"].fillna("None")
all_data["Alley"] = all_data["Alley"].fillna("None")
all_data["Fence"] = all_data["Fence"].fillna("None")
all_data["FireplaceQu"] = all_data["FireplaceQu"].fillna("None")
all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform(
    lambda x: x.fillna(x.median()))
for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
    all_data[col] = all_data[col].fillna('None')

for col in ('BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath'):
    all_data[col] = all_data[col].fillna(0)
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
    all_data[col] = all_data[col].fillna('None')
all_data["MasVnrType"] = all_data["MasVnrType"].fillna("None")
all_data["MasVnrArea"] = all_data["MasVnrArea"].fillna(0)
all_data['MSZoning'] = all_data['MSZoning'].fillna(all_data['MSZoning'].mode()[0])
all_data = all_data.drop(['Utilities'], axis=1)
all_data["Functional"] = all_data["Functional"].fillna("Typ")
all_data['Electrical'] = all_data['Electrical'].fillna(all_data['Electrical'].mode()[0])
all_data['KitchenQual'] = all_data['KitchenQual'].fillna(all_data['KitchenQual'].mode()[0])
all_data['Exterior1st'] = all_data['Exterior1st'].fillna(all_data['Exterior1st'].mode()[0])
all_data['Exterior2nd'] = all_data['Exterior2nd'].fillna(all_data['Exterior2nd'].mode()[0])
all_data['SaleType'] = all_data['SaleType'].fillna(all_data['SaleType'].mode()[0])
all_data['MSSubClass'] = all_data['MSSubClass'].fillna("None")

Making categorical variables dummy variables

Next, make the categorical variable a dummy variable. ** As a caveat when doing this, make sure that the training and test data are matched first. Then, let's process the matching data. If you perform dummy variable processing as separate data, the number of columns may not match, and you will not be able to predict at the end. ** **

I treated them separately and later sometimes the number of columns didn't match. I tried to force the number of columns to be aligned, but I tried to output the missing and many dummy variables to csv, but I gave up. .. Let's match from the beginning. ..


all_data = pd.get_dummies(all_data)
print(all_data.shape)

Summary

That's all for this time. In the second half, we will actually analyze and aim to improve the accuracy of the model.