It's been about half a year since I started studying Python and machine learning in January of this year.
I started working on basic grammar understanding, and then copied data engineering and model building, but now I wanted to try what I could do on my own, so I tried House Prices, which is an introductory part of Kaggle.
In addition, since this is the first post by an amateur of machine learning, I would appreciate it if you could withdraw your fangs to some extent due to the childishness of the text and lack of knowledge, and if there are any mistakes, please comment like biting with milk teeth.
This post is intended for "not deeply knowledgeable in engineering and programming" and "beginners within a year of starting machine learning".
We do not perform complicated data processing or model construction, but challenge with the basic methodology that the chicks have finally started to fly away.
I hope that those who have just understood the basics of machine learning can use it as a reference when processing data and building models on their own.
One of Kaggle's competitions is the issue of predicting home prices. This is the so-called regression problem. The topics are easy to understand, and the amount of data and features is not so large, so it is a perfect subject for Kaggle beginners to work on. https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview!
First, let's read the training data and test data.
read_data.ipynb
import pandas as pd
pd.set_option("display.max_columns" , 200)
pd.set_option("display.max_rows" , 100000)
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
You can increase the maximum number of columns and rows displayed in a DataFrame to any number with pd.set_option. By default, if there are many columns or rows in the DataFrame, some of them will be hidden, but if you want to display them without omitting them, it is recommended to set them.
View training data.
train.head()
Id | MSSubClass | MSZoning | LotFrontage | ... |
---|---|---|---|---|
1 | 60 | RL | 65.0 | ... |
2 | 20 | RL | 80.0 | ... |
3 | 60 | RL | 68.0 | ... |
Only a part of it is displayed here, but as a rough impression,
・ Numerical values (continuous variables) and characters (categorical variables) are mixed. ・ There are variables that are missing. ・ The scale of the size is different even between numerical values.
It was like this.
It is recommended that you have a rough understanding of the definition of each variable at this stage. There is a list of variable definitions in data_description.txt, so please read it except for those who induce anaphylactic shock in English.
After this, I checked a list of which variables had missing values.
train.isnull().sum()
Id 0 MSSubClass 0 MSZoning 0 LotFrontage 259 LotArea 0 Street 0 Alley 1369 ...
In addition, it was missing in more than 10 features. Here, we will limit the column to the extent that the column with the missing value is completely recognized, and perform the complement processing later.
Oh, it's better to divide the training data into explanatory variables and objective variables before data processing. Also, the training data "Id" is unnecessary for processing, so delete it.
train_X = train.iloc[:,:-1]
train_y = train["SalePrice"]
train_X = train_X.drop(["Id"] , axis=1)
Don't forget to specify axis = 1 when deleting unnecessary columns with drop.
Next is the conversion process of categorical variables.
There are two types of variables in the training data, categorical variables and continuous variables, but first we will process the categorical variables.
For example, MSZoning (A, C, FV ..) and LotShape (Reg, IR1 ..) are categorical variables.
In order to build the model, it is necessary to replace it with a number that has no meaning of size, so apply a label encoder to convert the number. (E.g Clothing size: S, M, L → 1,2,3)
Create a horrifying list of variables with a white eye.
By the way, in the last line, each variable is converted to a character string by astype (str), but if you do not do this, the following error will be thrown, so be careful.
TypeError: argument must be a string or number
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
columns = ["MSZoning","Street","Alley","LotFrontage","Street","Alley","LotShape","LandContour","Utilities","LotConfig","LandSlope","Neighborhood","Condition1","Condition2","BldgType","HouseStyle","RoofStyle","RoofMatl","Exterior1st","Exterior2nd","MasVnrType","ExterQual","ExterCond","Foundation","BsmtQual","BsmtCond","BsmtExposure","BsmtFinType1","BsmtFinType2","Heating","HeatingQC","CentralAir","Electrical","KitchenQual","Functional","FireplaceQu","GarageType","GarageFinish","GarageQual","GarageCond","PavedDrive","PoolQC","Fence","MiscFeature","SaleType","SaleCondition"]
for col in columns:
train_X[col] = le.fit_transform(train_X[col].astype(str))
What do you think? I think that the value of MSZoning is converted from RL to 3 by applying the label encoder.
MSSubClass | MSZoning | LotFrontage | ... |
---|---|---|---|
60 | 3 | 65.0 | ... |
20 | 3 | 80.0 | ... |
60 | 3 | 68.0 | ... |
Just in case, let's check the unique value of MSZoning.
train_X["MSZoning"].unique()
array([3, 4, 0, 1, 2])
You can see that each value such as RM and RL is firmly encoded into a numerical value.
I think that other categorical variables are converted in the same way, so please check it.
Next is the completion of missing values. Since there are variables that are missing in both categorical variables and continuous variables, we will process each.
Let's check the missing values again below.
train.isnull().sum()
Then, the missing value was greatly reduced from the previous time. This is because the label encoder has replaced the missing value with a specific number. Is it okay for a moment? I thought, but I think that there is no problem because the missing value itself is identified.
There were three missing values: LotFrontage, MasVnrArea, and GarageYrBlt.
In conclusion ・ LotFrontage is the average value ・ MasVnrArea is the median ・ GarageYrBlt is 0 Complemented with.
First of all, regarding the top two, I thought that it would be better to supplement with the median or mean value rather than excluding missing rows, so I checked the distribution status of the data for the time being.
#The size of the frame to put the graph
plt.figure(figsize=(10, 4))
#Placed on the left side of the 1-by-2 frame
plt.subplot(1,2,1)
plt.hist(train["MasVnrArea"] , bins=30 , label="MasVnrArea")
plt.legend(loc="best")
print("[MasVnr Area] Average value:{:.2f}Median:{:}".format(train["MasVnrArea"].mean() , train["MasVnrArea"].median()))
#Placed on the right side of the 1-by-2 frame
plt.subplot(1,2,2)
plt.hist(train["LotFrontage"] , bins=30 , label="LotFrontage")
print("[Lot Frontage] Average value:{:.2f}Median:{:}".format(train["LotFrontage"].mean() , train["LotFrontage"].median()))
plt.legend(loc="best")
Looking at the graph and the median and mean, MasVnrArea is quite different from the median and mean.
I thought that if I put the average value even though most of the values are 0, it would be pulled to the extreme value, so MaxVnrArea decided to put the median value.
LotFrontage is about the same for both values, but for the time being, I put in the average value.
#Median complement
train_X["MasVnrArea"] = train_X["MasVnrArea"].fillna(train_X["MasVnrArea"].median())
#Mean complement
train_X["LotFrontage"] = train_X["LotFrontage"].fillna(train_X["LotFrontage"].mean())
Finally, GarageYrBlt, this is the year of construction. Perhaps the record missing this doesn't have Garage, which means that other Garage-related variables have replaced their originally NaN values with numbers. I will omit the confirmation method, but it seems like that.
So, even if GarageYrBlt sets the missing value to "0", the AI will recognize "Oh, this guy doesn't have Garage" when compared with other Garage-related variables.
Under the hypothesis, replace with 0.
train_X["GarageYrBlt"] = train_X["GarageYrBlt"].fillna(0)
Now all the variables have no missing values and can be replaced with numbers. Now I can put it into a machine learning model, but I somehow thought.
"Isn't there too many variables? Should I narrow it down?"
So I will try to narrow down the variables.
This time, if the correlation coefficient between SalePrice, which is the objective variable, and each variable is above a certain value, that variable is adopted.
First, I wanted to get a quick visual overview of the correlation, so I'll use the seaborn heatmap.
plt.figure(figsize=(20,15))
sns.heatmap(train.corr() , cmap="Blues" , annot=True)
The size of the heat map is set by figsize. For each argument, corr () is the method for outputting the correlation coefficient, cmap is the color of the heatmap, and annot is whether or not to output the correlation coefficient for each cell. You can see the correlation coefficient between SalePrice and each variable by looking at the rightmost column, but at first glance it seems that variables of 0.3 or more are important. So, output the variables with the largest correlation coefficient in descending order, and pick up only the variables of 0.3 or more.
train.corr()["SalePrice"].sort_values(ascending=False)
SalePrice 1.000000 OverallQual 0.790982 GrLivArea 0.708624 GarageCars 0.640409 GarageArea 0.623431 ...
If you output in descending order, it will be in this format, but it seems that you need to make it a DataFrame type to extract the variable name.
#Convert to dataframe and list column names
df = pd.DataFrame(train.corr()["SalePrice"].sort_values(ascending=False))
df = df.query("0.3 <= SalePrice < 1.0")
columns_needed = np.array(df.index)
train_X = train_X[columns_needed]
The record (variable + value) with a specific correlation coefficient is extracted by df.query on the second line. The reason why 1 or less is included in the condition is to pull out the Sale Price. By extracting the variable name that is the index in df.index on the 3rd line and replacing it with an array, you can create an array that contains only variables with a correlation coefficient of 0.3 or more. In the 4th line, change the training data to the data frame with only the above variables.
Finally, the processing of the training data has been completed. However, we must not forget the processing of test data.
By the way, I noticed that if I forgot to process the test data and made the model as it was, an error occurred and I did not process the test data there. This is the first despair.
However, shouldn't it be processed in the same way as the training data? I thought, and if I processed it almost as it was, it would be an error again. This is my second despair.
why! !! !! !! !!
When I saw the error, it seemed to be due to the missing value, so I calmly checked the missing value in the test data.
Id 0 MSSubClass 0 MSZoning 4 LotFrontage 227 LotArea 0 Street 0 Alley 1352 ...
Apparently, the training data and the number of missing values and the missing variables are different.
Hmm ... It's hard for beginners, but let's handle it one by one.
The beginning is the same process as the training data.
#Remove ID from test data
test_X = test.drop(["Id"] , axis=1)
#Apply Label Encoder to Categorical Variables
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
columns = ["MSZoning","Street","Alley","LotShape","LandContour","Utilities","LotConfig","LandSlope","Neighborhood","Condition1","Condition2","BldgType","HouseStyle","RoofStyle","RoofMatl","Exterior1st","Exterior2nd","MasVnrType","ExterQual","ExterCond","Foundation","BsmtQual","BsmtCond","BsmtExposure","BsmtFinType1","BsmtFinType2","Heating","HeatingQC","CentralAir","Electrical","KitchenQual","Functional","FireplaceQu","GarageType","GarageFinish","GarageQual","GarageCond","PavedDrive","PoolQC","Fence","MiscFeature","SaleType","SaleCondition"]
for col in columns:
test_X[col] = le.fit_transform(test_X[col].astype(str))
Next is the processing of missing values.
Even after label encoding, more continuous variables show missing values than the training data.
Let's fill in the missing values in the same way as before.
#Mean complement
test_X["LotFrontage"] = test_X["LotFrontage"].fillna(test_X["LotFrontage"].mean())
test_X["BsmtUnfSF"] = test_X["BsmtUnfSF"].fillna(test_X["BsmtUnfSF"].mean())
test_X["BsmtFullBath"] = test_X["BsmtFullBath"].fillna(test_X["BsmtFullBath"].mean())
test_X["GarageArea"] = test_X["GarageArea"].fillna(test_X["GarageArea"].mean())
test_X["TotalBsmtSF"] = test_X["TotalBsmtSF"].fillna(test_X["TotalBsmtSF"].mean())
test_X["MasVnrArea"] = test_X["MasVnrArea"].fillna(test_X["MasVnrArea"].mean())
test_X["BsmtFinType2"] = test_X["BsmtFinType2"].fillna(test_X["BsmtFinType2"].mean())
test_X["BsmtFinSF1"] = test_X["BsmtFinSF1"].fillna(test_X["BsmtFinSF1"].mean())
test_X["BsmtFinSF2"] = test_X["BsmtFinSF2"].fillna(test_X["BsmtFinSF2"].mean())
test_X["BsmtHalfBath"] = test_X["BsmtHalfBath"].fillna(test_X["BsmtHalfBath"].mean())
test_X["GarageCars"] = test_X["GarageCars"].fillna(test_X["GarageCars"].mean())
#0 complement
test_X["GarageYrBlt"] = test_X["GarageYrBlt"].fillna(0)
Finally, as with the training data, narrow down the variables with a correlation coefficient of 0.3 or more. columns_needed is an array of variables with a correlation coefficient of 0.3 or more created during training data processing.
test_X = test_X[columns_needed]
The test data is also complete. It's finally time to build the model.
This time I made a model in Random Forest. What is Random Forest? I won't go into the details of this in this article, but the reason I chose it is because it seems to be more accurate than linear regression or SVM, and I can make it myself.
I understand the theoretical story of the model, but ... well, beginners are like this.
Let's make it.
After importing the random forest, create a list of parameters for grid search.
#Model building in random forest
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()
#Grid search for parameters
from sklearn.model_selection import GridSearchCV
parameters = {"n_estimators":[10,30,50,70,100,130],
"criterion":["mae","mse"],
"max_depth":[3,5,7,10,15],
"max_features":["auto"],
"random_state":[0],
"n_jobs":[-1]}
I would like you to refer to other articles for grid search, but if you explain roughly, you said, "The best model is created by selecting the appropriate combination of parameters from multiple parameters."
I tried to select the detailed parameters that seemed to need optimization while looking at the help site of sklearn. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
Create a model by grid search using the above parameter list.
#Model creation with grid search
clf = GridSearchCV(rf , parameters , scoring="neg_root_mean_squared_error" , cv=5)
clf.fit(train_X , train_y)
#Get the best parameters
print(clf.best_estimator_)
For the argument of GridSearchCV, specify RMSE, which is Kaggle's evaluation index, for scoring, and 5 for CV, which means the number of cross-validation divisions. Each was set with reference to the following, but keep in mind that this is not always the correct parameter. https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mae', max_depth=15, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=130, n_jobs=-1, oob_score=False, random_state=0, verbose=0, warm_start=False)
It takes some time to output, but I was able to get the best parameters safely. I want to finally predict the test data.
I'm finally here. Finally, let's predict the test data.
pred_y = clf.predict(test_X)
pred_y
array([122153.44615385, 149360.57692308, 174654.18846154, ..., 160791.61538462, 106482.30769231, 236272.41923077])
It was output safely. After that, I will make the data for submission in the hope that this prediction data can give a high score.
test["SalePrice"] = pred_y
test[["Id","SalePrice"]].head()
import csv
test[["Id","SalePrice"]].to_csv("submission.csv" , index=False)
You now have a file called submission.csv in your locale.
By the way, please note that if index is set to True, unnecessary index numbers will be created in the submission file and will not match the format required for submission.
Click here for the results. The score is 0.15453 and the ranking is 3,440. (About 67% of the total)
No, it's low ... I have a great sense of accomplishment that I was able to handle data processing and model construction on my own, but I'm still disappointed when I see the results.
However···
After this, after a lot of trial and error, I was ranked up to 2,074th place with a score of 0.13576. (About 40% of the total)
I would like to spell out what kind of processing was done after this in the second part.
Recommended Posts