In the next step of the Titanic, I tried to predict the house price, which is an introduction to kaggle. Titanic had quite a few articles, but the house prices were low so I'll post them. Since I am a beginner, the score was low, so I would appreciate it if you could give me some advice.
Data preprocessing was performed with reference to this article. "Data Preprocessing"-Kaggle Popular Tutorial
This time, since it is a regression analysis, I will try linear regression, Lasso regression, and Ridge regression.
#Prepare training data
X_train = df_train[['OverallQual', 'YearBuilt', 'TotalBsmtSF', 'GrLivArea']]
y_train = df_train['SalePrice']
#Training data Separate by test data
from sklearn.model_selection import train_test_split
train_X, test_X, train_y, test_y = train_test_split(
X_train, y_train, random_state=42)
__ Model building __
#Linear regression
#Import module
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
#Linear regression
lr = LinearRegression()
lr.fit(train_X, train_y)
print("Linear regression:{}".format(lr.score(test_X, test_y)))
#Lasso return
lasso = Lasso()
lasso.fit(train_X, train_y)
print("Lasso return:{}".format(lasso.score(test_X, test_y)))
#Ridge regression
ridge = Ridge()
ridge.fit(train_X, train_y)
print("Ridge regression:{}".format(ridge.score(test_X, test_y)))
The result is as follows __ Linear regression: 0.8320945695605152__ __ Lasso Return: 0.5197737962239536__ __ Ridge regression: 0.8324316647361567__
Data reading
#Read test data
df_test = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')
output
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
0 1461 20 RH 80.0 11622 Pave NaN Reg Lvl AllPub ... 120 0 NaN MnPrv NaN 0 6 2010 WD Normal
1 1462 20 RL 81.0 14267 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN Gar2 12500 6 2010 WD Normal
2 1463 60 RL 74.0 13830 Pave NaN IR1 Lvl AllPub ... 0 0 NaN MnPrv NaN 0 3 2010 WD Normal
3 1464 60 RL 78.0 9978 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN NaN 0 6 2010 WD Normal
4 1465 120 RL 43.0 5005 Pave NaN IR1 HLS AllPub ... 144 0 NaN NaN NaN 0 1 2010 WD Normal
5 rows × 80 columns
__ Check for missing values __
#Check for missing values
df_test[['OverallQual', 'YearBuilt', 'TotalBsmtSF', 'GrLivArea']].isnull().sum()
output
OverallQual 0
YearBuilt 0
TotalBsmtSF 1
GrLivArea 0
dtype: int64
There is a missing value in TotalBsmtSF (underground area). This time, the average value is used to supplement the defect.
#Complement missing values
df_test['TotalBsmtSF'] = df_test['TotalBsmtSF'].fillna(df_test['TotalBsmtSF'].mean())
__ Perform the remaining preprocessing __
#Extract ID
df_test_index = df_test['Id']
#Logarithmic conversion
df_test['GrLivArea'] = np.log(df_test['GrLivArea'])
#Convert categorical variables
df_test = pd.get_dummies(df_test)
#Enter a value for the missing value
df_test[df_test['TotalBsmtSF'].isnull()]
X_test = df_test[['OverallQual', 'YearBuilt', 'TotalBsmtSF', 'GrLivArea']]
__ Linear regression __
#Linear regression
#Predicted value
pred_y = lr.predict(X_test)
#Creating a data frame
submission = pd.DataFrame({'Id': df_test_index,
'SalePrice': np.exp(pred_y)})
#Output to CSV file
submission.to_csv('submission_lr.csv', index=False)
__ Lasso return __
#Lasso return
#Predicted value
pred_y = lasso.predict(X_test)
#Creating a data frame
submission = pd.DataFrame({'Id': df_test_index,
'SalePrice': np.exp(pred_y)})
#Output to CSV file
submission.to_csv('submission_lasso.csv', index=False)
__ Ridge regression __
#Ridge regression
#Predicted value
pred_y = ridge.predict(X_test)
#Creating a data frame
submission = pd.DataFrame({'Id': df_test_index,
'SalePrice': np.exp(pred_y)})
#Output to CSV file
submission.to_csv('submission_ridge.csv', index=False)
At ridge regression The result is 0.16450 (lower is better)
So how do you improve your score?
Next time I will try another tutorial.
Recommended Posts