Competition for Bigginer held at SIGNATE Since I participated in [5th _Beginner Limited Competition] House Price Forecast for Ames, an American City, I will write a record of participation. Here continued.
train.csv Number of data: 3000 Number of columns: 47 Approximately half of the features listed in data_description.txt. It seems that the data whose operation is strange will be erased.
test.csv Number of data: 2000
LightGBM seems to handle missing values well (doesn't it just cause an error?), I checked it just in case, but the number of defects is 0 in all columns. The data did not need to be considered.
print(train_data.isnull().sum())
print(test_data.isnull().sum())
'Sale Price' is the expected selling price ($). Maximum: 418000 Minimum: 80000 It feels like this is all.
sns.displot(train_data['SalePrice'], height=5, aspect=1)
Since the data looks like a normal distribution, it looks like it can be applied as it is to a linear regression model. It seems that the data of 250000 $ or more is out of order, but it is a decision tree system, and Since the number of data is also reasonable, I will leave it as it is. (The decision tree system looks good for the threshold value, so I don't think it has anything to do with the normal distribution, but it disappears by chance.
I checked some of the important data directly. Skipped'Garage Cars' and'Garage Area'because seaborn was frozen and could not be displayed as a graph.
fig, ax = plt.subplots(2, 2, figsize=(10, 10))
sns.histplot(x=train_data['BsmtFin SF 1'], ax=ax[0, 0])
sns.histplot(x=train_data['Bsmt Unf SF'], ax=ax[0, 1])
sns.histplot(x=train_data['Bsmt Full Bath'], ax=ax[1, 1])
sns.histplot(x=train_data['Total Bsmt SF'], ax=ax[1, 0])
Since the test data had the same tendency, there seemed to be no major problems with the features.
The features that I was interested in below.
There is no sale price below 100000 \ $ above 1500 square feet (x-axis), and
conversely, the thick layer of 250000 \ $ (y-axis) seems to be a correlation.
I thought it would be okay to erase the data of 400000 \ $ near 500 square feet as an outlier, but
the decision tree is strong against outliers, so I have not dealt with it.
If you take it seriously, you should leave it if you know the reason why this one data was priced higher, and delete it if you do not know.
(Intuitively, BsmtFin SF 1 is the most important, so should we consider that the adverse effects of outliers on BsmtFin SF 1 outweigh the advantages of other features and eliminate them without question?)
Total Bsmt SF: It seems to be underground, but it looks strange that there is a minus.
, but I leave it as it is because the test target also has negative value data.
However, I felt uncomfortable, so if it was negative, I forced the value to 0.
Since'Garage Cars' and'Garage Area'were in the same state from the minimum value, the same measures are taken.
Garage Cars: I think it's the number of cars that can be accommodated, but it's suspicious because it's a terrible half-hearted value such as 1.998002859.
In 1-2, object type is converted to categorical type at once in order to pass train (). I think that these features represent a character string = some kind and are appropriate.
For other features, check if the current model can be used. In conclusion, the following data has been changed to categorical data.
Column name | Overview | Reason |
---|---|---|
MS SubClass | Identifies the type of dwelling involved in the sale. Types on sale? |
Although each data is numerical 190 = 2 FAMILY CONVERSION -It is a type such as ALL STYLES AND AGES. |
Overall Qual | Rates the overall material and finish of the house. So, the evaluation of the property. |
Although each data is numerical 10 =It is a type of 10-grade evaluation such as Very Excellent. |
Overall Cond | Rates the overall condition of the house. So, the evaluation of the property. |
Although each data is numerical 10 =It is a type of 10-grade evaluation such as Very Excellent. |
Year Built | Original construction date Year of construction |
Since it is a year, it is converted into categorical data |
Year Remod/Add | Remodel date (same as construction date if no remodeling or additions) Year of renovation?However, Year Built> Year Remod/The mystery is that there is a lot of Add data. |
Since it is a year, it is converted into categorical data |
Mo Sold | Month Sold (MM) Sales month |
Since it is a month, it is converted into categorical data |
Yr Sold | Year Sold (YYYY) Year of sale |
Since it is a year, it is converted into categorical data |
Actually, you should check whether data other than the type described in data_description.txt is mixed in each categorical data, but I stopped because it seemed to be relatively organized data from the feeling of the data so far.
The additional features and acceptance / rejection that came up are as follows.
Feature value | Assumption | Acceptance / rejection |
---|---|---|
Year of sale(Yr Sold) -Year of construction(Year Built) | Image of age. | Adopted as age column. |
Year of renovation(Year Remod/Add) -Year of construction(Year Built) | If there is no refurbishment'Year Remod/Add' = 'Year Built'Therefore, I wanted to add a feature that expresses it. If that's the case, Boolean is fine, but I think that the difference will have an effect anyway, so I calculate the difference. | sub_Adopted as a year column. |
Garage size(Garage Area) /Number of cars(Garage Cars) | The size per car can be calculated? I thought that if there was room in the garage, it would give a sense of luxury. |
not adopted. For some reason, I wasn't sure about the contents of the negative Garage Cars data, and because the land is large in the United States, I don't care about such a small area. |
Inflation rate or price index | Year of sale'Yr Sold'2006-The range for the four years of 2010. Temporarily annual inflation rate 3%In 2006, 100,000$Property is 112,000 in 2010$ (The calculation is correct?)Because it becomes For example, I thought it would be nice to have a value such as how much this year is based on 100 dollars in 2000. |
not adopted. I stopped believing that it was a lot of work to look up and that it was woven into Yr Sold's yearly data.~~I'm pretty worried, so please give someone a try.~~Try with 4 things you did behind. Somewhat effective. |
Lehman shock | Yr Sold is 2006-The range for the four years of 2010. Since the Lehman shock was in 2008, I thought it was highly possible that housing-related prices would be affected before and after that.(The Lehman shock should have affected the bursting of the housing bubble, etc.) | not adopted. I stopped believing that I could come up with a concrete indicator and that it was woven into Yr Sold's annual data. If you look at the news around the subprime mortgage crisis, there seems to be some good indicator. |
RMSE: 26523.4069341 Ranking: 85/552, so top 16%
RMSE improved by about 30 and the ranking was almost unchanged, so there was not much effect. In terms of the feel of the importance of the added features, the importance was about the top 40%. It may not have been in vain at all.
At the moment, LightGBM was set with almost default parameters, so The parameters were set in the direction of avoiding overfitting. Reference URL https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html Also, there may be a mixture of various googles ...
lgbm_params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'regression',
'metric': 'rmse',
'max_depth' : 10
}
model = lgb.cv(lgbm_params, lgb_train,
num_boost_round=1000,
early_stopping_rounds=50,
verbose_eval=50,
nfold=7,
shuffle=True,
stratified=False,
seed=42,
return_cvbooster=True,
)
lgbm_params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'regression',
'metric': 'rmse',
'learning_rate': 0.01,
'max_depth' : 7,
'num_leaves': 80,
'bagging_fraction': 0.8,
'bagging_freq': 1,
'feature_fraction': 0.9,
}
model = lgb.cv(lgbm_params, lgb_train,
num_boost_round=10000,
early_stopping_rounds=100,
verbose_eval=50,
nfold=10,
shuffle=True,
stratified=False,
seed=42,
return_cvbooster=True,
)
Parameters | before -> after | Modified image |
---|---|---|
learning_rate | 0.1 -> 0.01 | It seemed to be over-learning, so I reduced it to 1/10 so that I could study in detail. Instead num_boost_The round is multiplied by 10. |
max_depth | 10 -> 7 | Generally 5-10,I remember that the larger it is, the easier it is to overfit, so I lowered it. |
num_leaves | ? -> 80 | Maximum value 2^(max_depth)According to the official document that the value slightly lowered from is good. |
bagging_fraction, bagging_freq |
1 -> 0.8 1 -> 1 |
Percentage of bagging. It is said that it is effective in avoiding overfitting, so I lowered it a little. |
feature_fraction | 1 -> 0.9 | Percentage of subsampling. It is said that it is effective in avoiding overfitting, so I lowered it a little. bagging_I don't know much about the difference from fraction, but bagging_Explanation of fraction "like feature"_fraction,but this will randomly select part of data without resampling ". https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst |
A learning curve that rewrites only the parameters of the code that executed the train () function in "What I did 1" The left side is before the change, the right side is after the change
Before the change, the 27th learning was the peak of eval, whereas After the change, the 404th time is the most accurate, and the score is higher than before the change & the decrease in the score in the right direction is small, which leads to the suppression of overfitting.
RMSE: 26196.7174197 Ranking: 30/553, so the top 6%
It was effective overall. RMSE improved by about 350 and the ranking rose by 50th.
Since there seemed to be room for parameter improvement, a parameter search was conducted. It seems that you can do it in one shot by using a library called optuna. LightGBM's cv () function is said to be OK just by replacing it with LightGBMTunerCV (), but ...
lgbm_params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'regression',
'metric': 'rmse',
'learning_rate': 0.01,
'max_depth' : 7,
'num_leaves': 80,
'bagging_fraction': 0.8,
'bagging_freq': 1,
'feature_fraction': 0.9,
}
model = lgb.cv(lgbm_params, lgb_train,
num_boost_round=10000,
early_stopping_rounds=100,
verbose_eval=50,
nfold=10,
shuffle=True,
stratified=False,
seed=42,
return_cvbooster=True,
)
lgbm_params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'regression',
'metric': 'rmse',
#Delete the parameter you want to search
}
tuner_cv = lgb.LightGBMTunerCV(
lgbm_params, lgb_train,
num_boost_round=10000,
early_stopping_rounds=100,
verbose_eval=20,
nfold=10,
shuffle=True,
stratified=False,
seed=42,
return_cvbooster=True,
)
tuner_cv.run()
It was really made. LightGBMTunerCV takes about 7 minutes while cv () takes about 20 seconds. Maybe this computer made effective use of the CPU for the first time.
print(tuner_cv.best_params)
{'task': 'train', 'boosting_type': 'gbdt', 'objective': 'regression', 'metric': 'rmse', 'feature_pre_filter': False, 'lambda_l1': 0.12711821269550255, 'lambda_l2': 6.733049435313309e-05, 'num_leaves': 62, 'feature_fraction': 0.4, 'bagging_fraction': 1.0, 'bagging_freq': 0, 'min_child_samples': 20}
Note: The source code with import optuna seems to be out of date.
RMSE: 26718.3022355 Ranking: Top 21% for 119/557
Overall it has dropped significantly. Is it better than doing a grid search yourself? I was worried that best_params did not have max_depth. It may just be unnecessary because it can be calculated from other parameters. For the time being, I found that I couldn't master optuna.
After data conversion and parameter setting, I was able to enter the top 6%.
Since there is still room for growth due to parameter optimization, If I can afford it, I will try to search for parameters a little more.
CatBoost was mentioned in an article somewhere that the default parameters are excellent. Is it faster to replace it with this algorithm?
I messed with the following data in the spare time.
data | Overview | effect |
---|---|---|
Categorize order column | Observation number. Order number. Overlooked but matching value(Purchased at the same time?)Because there was also, it was categorized. |
No effect It may be because I did it at the same time as other corrections, but there is almost no change in RMSE and importance. |
Set the following negative values to 0, delete 'Total Bsmt SF' 'Garage Cars' 'Garage Area' |
[What I did 3]Since the negative value is strange, I overwrote it with 0, but I thought it was forcible and deleted the process. | No effect It may be because I did it at the same time as other corrections, but there is almost no change in RMSE and importance. |
'Garage Cars'Rounding of | Looking at the data,-0.00199, 1.Approximately 0 from an integer, such as 9980.00199xx It looked like it had less float type. It seemed that something like a round-off error was occurring, so it was uniformly 0.Add only 002, make it int type and delete after the decimal point. |
No effect It may be because I did it at the same time as other corrections, but there is almost no change in RMSE.Importance has dropped. |
'Bsmt Full Bath'Rounding of | As with Garage Cars, I saw a fraction, so I made it an int type and deleted it. | No effect It may be because I did it at the same time as other corrections, but there is almost no change in RMSE and importance. |
Addition of consumer index | I mentioned that I didn't do it at the top, but I did it because I was curious. https://ecodb.net/country/US/imf_cpi.html https://jp.investing.com/economic-calendar/cpi-733 Roughly calculate the numerical value on a monthly basis from the above site, and sell it-The January index has been added to the new column. As an example, if January 2010 is the sales date,Added December 2009 index. * 1 |
Effective 30 improvement in RMSE. The ranking also rises by 1st place. Since it is down only in September 2019, the data can take into account the Lehman shock. |
'Year Remod/Add'Correction | At the top'Year Built' > 'Year Remod/Add'I wrote that it is a mystery that there is a lot of data in 1950, but when I checked the data a little more, the data in 1950 is abnormally large. For example, if there is no input in the input system'Year Remod/Add'Assuming that there may be cases where is forced to be in 1950.'Year Built' > 'Year Remod/Add'in the case of,'Year Built'The value of'Year Remod/Add'Overwritten on. just a little'Year Built' > 'Year Remod/Add'However, there was some data that was not in 1950. .. | Effective RMSE improved by 60, ranking increased by 3rd place. It seems that this data was stuck. |
When it comes to the ranking around here, it seems that one or two people will be eliminated just by increasing the RMSE by 10.
RMSE: 26106.3566493 Rank: 27/582, so it was the top 4.6%.
The 1st place RMSE is 25825.5265928, so it's still quite far away.