Previous, I implemented it by linear regression, but this time I implemented it using non-linearity.
I continued to implement data preprocessing by referring to this article. "Data Preprocessing"-Kaggle Popular Tutorial
① Linear Regression ② Ridge regression (Ridge) ③ Support vector machine regression (SVR) ④ RandomForestRegressor I made a model for these four.
#Explanatory variables and objective variables
x = df_train[['OverallQual', 'YearBuilt', 'TotalBsmtSF', 'GrLivArea']]
y = df_train['SalePrice']
#Import module
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
#Separate training data and test data
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
def calc_model(model):
#Train the model
model.fit(X_train, y_train)
# X_Predicted value for test
pred_y = model.predict(X_test)
#Get mean square error
score = mean_squared_error(y_test, pred_y)
return score
#For linear regression
from sklearn.linear_model import LinearRegression
#Build a model
lr = LinearRegression()
#Calculate mean square error
lr_score = calc_model(lr)
lr_score
# >>>output
0.02824050462867693
#At the time of Ridge regression
from sklearn.linear_model import Ridge
#Build a model
ridge = Ridge()
#Calculate mean square error
ridge_score = calc_model(ridge)
ridge_score
# >>>output
0.028202963714955512
#Support vector machine regression
from sklearn.svm import SVR
#Build a model
svr = SVR()
#Calculate mean square error
svr_score = calc_model(svr)
svr_score
# >>>output
0.08767857928794534
#During random forest regression
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor()
#Calculate mean square error
forest_score = calc_model(forest)
forest_score
# >>>output
0.03268455739481754
As a result, the mean square error of nonlinear regression was large.
#Test data preprocessing
#Extract the value of Id
df_test_index = df_test['Id']
#Confirmation of missing values
df_test = df_test[['OverallQual', 'YearBuilt', 'TotalBsmtSF', 'GrLivArea']]
df_test.isnull().sum()
# >>>output
OverallQual 0
YearBuilt 0
TotalBsmtSF 1
GrLivArea 0
dtype: int64
Complement the missing value of TotalBsmtSF with the average value.
#Complement missing values with mean values
df_test['TotalBsmtSF'] = df_test['TotalBsmtSF'].fillna(df_test['TotalBsmtSF'].mean())
#Check for missing values
df_test.isnull().sum()
# >>>output
OverallQual 0
YearBuilt 0
TotalBsmtSF 0
GrLivArea 0
dtype: int64
There are no missing values.
#Fit the model
pred_y = ridge.predict(df_test)
#Creating a data frame
submission = pd.DataFrame({'Id': df_test_index,
'SalePrice': np.exp(pred_y)})
#Output to CSV file
submission.to_csv('submission.csv', index=False)
The result was 0.17184, and the result did not increase.
Recommended Posts