Using a dataset from Kaggle's House Sales in King County, USA, I used XGboost machine learning to generate a learning model and Flask to turn it into an API server. This machine learning API server has four main steps. First, in order to understand the data of House Sails, EDA (Exploratory Data Analysis) exploratory data analysis is performed to understand the status of the data. Next, preprocessing is performed so that the data will be trained by machine learning. Next, machine learning is used to generate a learning model. This time, I'm using XG boost. Finally, we will implement the API server with Flas.
Libraries such as Anaconda, XGBoost, joblib, Flask, and flask-cors are installed.
--Understanding House Sails data (EDA) --Pre-process the dataset --Making a learning model with machine learning --Create an API server with flask
First, you can load the required Labrari and the House Sails dataset that you downloaded from Kaggle. You can specify the number of columns to display by specifying set_option.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', 4)
df = pd.read_csv('house_sales/kc_house_data.csv')
df.head()
id | date | ... | sqft_living15 | sqft_lot15 | |
---|---|---|---|---|---|
0 | 7129300520 | 20141013T000000 | ... | 1340 | 5650 |
1 | 6414100192 | 20141209T000000 | ... | 1690 | 7639 |
2 | 5631500400 | 20150225T000000 | ... | 2720 | 8062 |
3 | 2487200875 | 20141209T000000 | ... | 1360 | 5000 |
4 | 1954400510 | 20150218T000000 | ... | 1800 | 7503 |
5 rows × 21 columns
First, of the features, the histogram of sqft_living is displayed. If you look at this, you can see that it deviates from the shape of the standard normal distribution because of the large values.
plt.figure(figsize = (12,8))
plt.hist(df["sqft_living"])
plt.savefig('House_Sales_Explanatory Data Analysis_hist01.png', bbox_inches='tight')
Let's display a histogram for price, which is the value predicted this time. As before, there are large values and outliers, so the shape is based on the left.
plt.figure(figsize = (12,8))
plt.hist(df["price"])
plt.savefig('House_Sales_Explanatory Data Analysis_hist02.png', bbox_inches='tight')
This function is processing the interquartile range. This will delete the outlier data.
def outlier_iqr(df, columns = None):
if columns == None:
columns = df.columns
for col in columns:
q1 = df[col].describe()['25%']
q3 = df[col].describe()['75%']
iqr = q3 - q1
outlier_min = q1 - iqr * 1.5
outlier_max = q3 + iqr * 1.5
df = df[(df[col] >= outlier_min) & (df[col] <= outlier_max)]
return df
df_1 = outlier_iqr(df, columns = ['price'])
df_1.shape
(20454, 21)
Let's display the histogram of the price processed in the interquartile range again. Now that we've removed the outliers, we have a histogram with a shape close to a normal distribution. If you check the shape of the data with shape, you can see that the number of data is slightly smaller than before, but it is not significantly reduced.
plt.figure(figsize = (12,8))
plt.hist(df_1["price"])
plt.savefig('House_Sales_Explanatory Data Analysis_hist03.png', bbox_inches='tight')
df.shape
(21613, 21)
For confirmation, let's check the shape of the histogram of other data. You can see that the shape of the histogram is generally well-balanced.
fig, axes = plt.subplots(2,3, figsize = (18, 12))
axes.ravel()[0].hist(df_1["sqft_living"])
axes.ravel()[1].hist(df_1["sqft_above"])
axes.ravel()[2].hist(df_1["sqft_basement"])
axes.ravel()[3].hist(df_1["lat"])
axes.ravel()[4].hist(df_1["long"])
axes.ravel()[5].hist(df_1["sqft_living15"])
axes.ravel()[0].set_title("sqft_living")
axes.ravel()[1].set_title("sqft_above")
axes.ravel()[2].set_title("sqft_basement")
axes.ravel()[3].set_title("lat")
axes.ravel()[4].set_title("long")
axes.ravel()[5].set_title("sqft_living15")
plt.savefig('House_Sales_Explanatory Data Analysis_hist04.png', bbox_inches='tight')
Here, we will delete the features that are no longer needed during learning. Here, id, date, sqft_lot, sqft_lot15, and zipcode are deleted.
df_1 = df_1.drop(columns = ['id', 'date', 'sqft_lot','sqft_lot15','zipcode'])
The feature quantity yr_built is the year of construction, so the year the building was built is in the data. At this rate, it is difficult to handle it as learning data, so we have added a new feature amount based on the age of the building. Also, yr_renovated is also in the year of refurbishment, so this is also changed to the number of years since it was refurbished.
df_1["age"] = 2020 - df_1["yr_built"]
df_1.loc[(df_1['yr_renovated'] == 0), 'yr_renovated'] = 2020
Numerical data is standardized here using StandardScaler.
from sklearn.preprocessing import StandardScaler
num_feature = ['sqft_living', 'sqft_above', 'sqft_basement', 'lat', 'long', 'sqft_living15']
for col in num_feature:
scaler = StandardScaler()
df_1[col] = scaler.fit_transform(np.array(df_1[col].values).reshape(-1, 1))
For the time being, I am checking the histogram of the numerical data again. You can see that the shape is the same as the one displayed earlier.
fig, axes = plt.subplots(2,3, figsize = (18, 12))
axes.ravel()[0].hist(df_1["sqft_living"])
axes.ravel()[1].hist(df_1["sqft_above"])
axes.ravel()[2].hist(df_1["sqft_basement"])
axes.ravel()[3].hist(df_1["lat"])
axes.ravel()[4].hist(df_1["long"])
axes.ravel()[5].hist(df_1["sqft_living15"])
axes.ravel()[0].set_title("sqft_living")
axes.ravel()[1].set_title("sqft_above")
axes.ravel()[2].set_title("sqft_basement")
axes.ravel()[3].set_title("lat")
axes.ravel()[4].set_title("long")
axes.ravel()[5].set_title("sqft_living15")
plt.savefig('House_Sales_Explanatory Data Analysis_hist05.png', bbox_inches='tight')
Finally, save it in CSV as data for machine learning.
df_price = df_1["price"]
df_price.to_csv('House_Sales_Explanatory_Price.csv')
Unnecessary data is deleted for training data, and categorical data is converted to dummy variables by get_dummies and saved as csv.
df_1 = df_1.drop(columns = ['price', 'yr_built', 'yr_renovated'])
df_1 = pd.get_dummies(df_1, columns = ['bedrooms', 'bathrooms', 'floors', 'waterfront', 'view', 'condition', 'grade', 'age', 'renovated_age'], drop_first = True)
df_1.to_csv('House_Sales_Explanatory_Preprocessing.csv')
Here, we will perform machine learning using the pre-processed data that was performed earlier. So it's better to keep it as a separate project.
I will import the library required for this process again.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
Load the preprocessed CSV file you saved earlier as pandas. There was an unnecessary feature called Unnamed, so I'll delete it. The number of features displayed by set_option is four.
pd.set_option('display.max_columns', 4)
df = pd.read_csv('House_Sales_Explanatory_Preprocessing.csv')
df = df.drop(columns = ['Unnamed: 0'])
df.head()
sqft_living | sqft_above | ... | renovated_age_80 | renovated_age_86 |
---|---|---|---|---|
0 | -1.026685 | -0.725963 | ... | 0 |
1 | 0.769106 | 0.635702 | ... | 0 |
2 | -1.556379 | -1.289885 | ... | 0 |
3 | -0.018975 | -0.904768 | ... | 0 |
4 | -0.380717 | -0.038253 | ... | 0 |
The price that is the correct answer data can be read in the same way.
price | |
---|---|
0 | 221900.0 |
1 | 538000.0 |
2 | 180000.0 |
3 | 604000.0 |
4 | 510000.0 |
df_price = pd.read_csv('House_Sales_Explanatory_Price.csv', header=None, names=['price'])
df_price.head()
In this machine learning, as a result of pre-learning, there is a big difference in learning accuracy between the learning model using all preprocessed data and the learning model using data with less important features deleted. Because I didn't get it, I decided to use only the following features. Basically, all numerical features are used, and for categorical features, only grade is left and other categorical data is deleted. The main reason is that we expected it would be a hassle to implement a front end for multiple categorical data when finally developing an application using a machine learning model.
df = df[["sqft_living","sqft_above","sqft_basement","lat","long","sqft_living15","grade_3","grade_4","grade_5","grade_6","grade_7"
,"grade_8","grade_9","grade_10","grade_11","grade_12"]]
sqft_living | sqft_above | ... | grade_11 | grade_12 | |
---|---|---|---|---|---|
0 | -1.026685 | -0.725963 | ... | 0 | 0 |
1 | 0.769106 | 0.635702 | ... | 0 | 0 |
2 | -1.556379 | -1.289885 | ... | 0 | 0 |
3 | -0.018975 | -0.904768 | ... | 0 | 0 |
4 | -0.380717 | -0.038253 | ... | 0 | 0 |
5 rows × 16 columns
Here, we are importing XGboost and doing machine learning. The parlor meter remains almost the default.
import xgboost as xgb
X_train, X_test, y_train, y_test = train_test_split(df, df_price, random_state = 0)
params = {
'silent' : 1,
'max_depth' : 6,
'min_chiled_weight' : 1,
'eta' : 0.1,
'tree_method' : 'exact',
'objective' : 'reg:linear',
'eval_metric' : 'rmse',
'predictor' : 'cpu_predictor'
}
dtrain = xgb.DMatrix(X_train, label = y_train)
dtest = xgb.DMatrix(X_test, label = y_test)
model = xgb.train(params = params,
dtrain = dtrain,
num_boost_round = 200,
early_stopping_rounds = 10,
evals = [(dtest, 'test')])
[0] test-rmse:471544
Will train until test-rmse hasn't improved in 10 rounds.
[1] test-rmse:427350
[2] test-rmse:387757
[3] test-rmse:352314
[4] test-rmse:320602
[5] test-rmse:292132
[6] test-rmse:266667
[7] test-rmse:244148
[8] test-rmse:223983
[9] test-rmse:206046
[10] test-rmse:190112
[11] test-rmse:176111
[12] test-rmse:163754
[13] test-rmse:152820
[14] test-rmse:143269
[15] test-rmse:134879
[16] test-rmse:127772
[17] test-rmse:121362
[18] test-rmse:115939
[19] test-rmse:111405
[20] test-rmse:107280
[21] test-rmse:103750
[22] test-rmse:100928
[23] test-rmse:98446.5
[24] test-rmse:96280.3
[25] test-rmse:94419.2
[26] test-rmse:92933.6
[27] test-rmse:91644.1
[28] test-rmse:90581.3
[29] test-rmse:89422.8
[30] test-rmse:88575.7
[31] test-rmse:88038.8
[32] test-rmse:87254.6
[33] test-rmse:86857.1
[34] test-rmse:86527.8
[35] test-rmse:86238.3
[36] test-rmse:85950
[37] test-rmse:85705
[38] test-rmse:85532.4
[39] test-rmse:85346.7
[40] test-rmse:85204.1
[41] test-rmse:85058.9
[42] test-rmse:84926.7
[43] test-rmse:84845.4
[44] test-rmse:84671.9
[45] test-rmse:84539.6
[46] test-rmse:84380.6
[47] test-rmse:84287.2
[48] test-rmse:84254.7
[49] test-rmse:84168.9
[50] test-rmse:84106.6
[51] test-rmse:83858.5
[52] test-rmse:83829.8
[53] test-rmse:83809.5
[54] test-rmse:83726
[55] test-rmse:83704.2
[56] test-rmse:83650.4
[57] test-rmse:83422.6
[58] test-rmse:83405.8
[59] test-rmse:83281
[60] test-rmse:83293.6
[61] test-rmse:83289.4
[62] test-rmse:83251.9
[63] test-rmse:83237.5
[64] test-rmse:83055.6
[65] test-rmse:83051.9
[66] test-rmse:82938.8
[67] test-rmse:82932.7
[68] test-rmse:82933.2
[69] test-rmse:82859
[70] test-rmse:82829.6
[71] test-rmse:82840.5
[72] test-rmse:82823
[73] test-rmse:82827.4
[74] test-rmse:82834.6
[75] test-rmse:82845.9
[76] test-rmse:82839.4
[77] test-rmse:82828.5
[78] test-rmse:82829.7
[79] test-rmse:82651.8
[80] test-rmse:82660
[81] test-rmse:82637.3
[82] test-rmse:82514.6
[83] test-rmse:82497.6
[84] test-rmse:82484.7
[85] test-rmse:82486.3
[86] test-rmse:82486.8
[87] test-rmse:82496
[88] test-rmse:82491.4
[89] test-rmse:82486.6
[90] test-rmse:82290.3
[91] test-rmse:82265.1
[92] test-rmse:82261.5
[93] test-rmse:82236.5
[94] test-rmse:82236.4
[95] test-rmse:82111.9
[96] test-rmse:82111.1
[97] test-rmse:82111.3
[98] test-rmse:82108
[99] test-rmse:82097.1
[100] test-rmse:82077.4
[101] test-rmse:82041.9
[102] test-rmse:82040
[103] test-rmse:82042.6
[104] test-rmse:82044.2
[105] test-rmse:82033.7
[106] test-rmse:82041.1
[107] test-rmse:82028.4
[108] test-rmse:82030.7
[109] test-rmse:82036.4
[110] test-rmse:82028.6
[111] test-rmse:82020.3
[112] test-rmse:82025.5
[113] test-rmse:82024.9
[114] test-rmse:82034
[115] test-rmse:82025.2
[116] test-rmse:81957.5
[117] test-rmse:81950.9
[118] test-rmse:81959.8
[119] test-rmse:81936.7
[120] test-rmse:81935.9
[121] test-rmse:81937
[122] test-rmse:81945.8
[123] test-rmse:81894.8
[124] test-rmse:81885.2
[125] test-rmse:81899.3
[126] test-rmse:81877
[127] test-rmse:81875.7
[128] test-rmse:81859.6
[129] test-rmse:81849.7
[130] test-rmse:81851.2
[131] test-rmse:81839.4
[132] test-rmse:81850.8
[133] test-rmse:81846
[134] test-rmse:81836.2
[135] test-rmse:81827.2
[136] test-rmse:81832.3
[137] test-rmse:81859.6
[138] test-rmse:81856.6
[139] test-rmse:81850
[140] test-rmse:81847.6
[141] test-rmse:81842.8
[142] test-rmse:81794.5
[143] test-rmse:81803.8
[144] test-rmse:81829.3
[145] test-rmse:81815.9
[146] test-rmse:81813.6
[147] test-rmse:81741
[148] test-rmse:81728.8
[149] test-rmse:81714.4
[150] test-rmse:81708.6
[151] test-rmse:81592.3
[152] test-rmse:81621.7
[153] test-rmse:81624.8
[154] test-rmse:81629.3
[155] test-rmse:81615.7
[156] test-rmse:81617.7
[157] test-rmse:81613.9
[158] test-rmse:81612.9
[159] test-rmse:81594.9
[160] test-rmse:81595.1
[161] test-rmse:81581.7
[162] test-rmse:81595.3
[163] test-rmse:81603.8
[164] test-rmse:81601.2
[165] test-rmse:81600.5
[166] test-rmse:81552.3
[167] test-rmse:81557.6
[168] test-rmse:81565.5
[169] test-rmse:81566.6
[170] test-rmse:81581.9
[171] test-rmse:81570.5
[172] test-rmse:81571.8
[173] test-rmse:81569.4
[174] test-rmse:81494.3
[175] test-rmse:81476.3
[176] test-rmse:81454
[177] test-rmse:81422.6
[178] test-rmse:81426.1
[179] test-rmse:81410.8
[180] test-rmse:81425.1
[181] test-rmse:81418.2
[182] test-rmse:81419.4
[183] test-rmse:81409.6
[184] test-rmse:81392.1
[185] test-rmse:81389.3
[186] test-rmse:81391.1
[187] test-rmse:81414.5
[188] test-rmse:81369.9
[189] test-rmse:81368.3
[190] test-rmse:81358.4
[191] test-rmse:81347.7
[192] test-rmse:81355.4
[193] test-rmse:81349.2
[194] test-rmse:81343
[195] test-rmse:81346.3
[196] test-rmse:81345.5
[197] test-rmse:81374.6
[198] test-rmse:81358.5
[199] test-rmse:81359.4
I'm doing a grid search to see if the model is a little more accurate. First, we are generating the parameters for grid search.
gridsearch_params = [
(max_depth, eta)
for max_depth in [6, 7, 8]
for eta in [0.1, 0.05, 0.01]
]
gridsearch_params
[(6, 0.1), (6, 0.05), (6, 0.01), (7, 0.1), (7, 0.05), (7, 0.01), (8, 0.1), (8, 0.05), (8, 0.01)]
Here we are calculating which combination of parameters is the most accurate. The result is Best params (8, 0.01).
min_rmse = float('Inf')
best_param = []
for max_depth, eta in gridsearch_params:
print('max_depth = {}, eta = {}'.format(max_depth, eta))
params['max_depth'] = max_depth
params['eta'] = eta
cv_results = xgb.cv(
params,
dtrain,
num_boost_round = 1000,
seed = 0,
nfold = 5,
metrics = {'rmse'},
early_stopping_rounds = 5
)
mean_rmse = cv_results['test-rmse-mean'].min()
boost_rounds = cv_results['test-rmse-mean'].argmin()
print('RMSE {} for {} rounds'.format(mean_rmse, boost_rounds))
if mean_rmse < min_rmse:
min_rmse = mean_rmse
best_param = (max_depth, eta)
print('Best params {}, RMSE {}'.format(best_param, min_rmse))
max_depth = 6, eta = 0.1
RMSE 81689.0296874 for 123 rounds
max_depth = 6, eta = 0.05
RMSE 81545.2953126 for 267 rounds
max_depth = 6, eta = 0.01
RMSE 82118.7765624 for 999 rounds
max_depth = 7, eta = 0.1
RMSE 81372.990625 for 161 rounds
max_depth = 7, eta = 0.05
RMSE 81372.7171876 for 202 rounds
max_depth = 7, eta = 0.01
RMSE 81308.89999979999 for 999 rounds
max_depth = 8, eta = 0.1
RMSE 81277.4515624 for 96 rounds
max_depth = 8, eta = 0.05
RMSE 81155.2687498 for 201 rounds
max_depth = 8, eta = 0.01
RMSE 81080.3156252 for 849 rounds
Best params (8, 0.01), RMSE 81080.3156252
Generate the training model again using the calculated parameters by the grid search earlier.
params['max_depth'] = 8
params['eta'] = 0.01
model = xgb.train(params = params,
dtrain = dtrain,
num_boost_round = 1000,
early_stopping_rounds = 5,
evals = [(dtest, 'test')])
[0] test-rmse:515961
Will train until test-rmse hasn't improved in 5 rounds.
[1] test-rmse:511040
[2] test-rmse:506173
[3] test-rmse:501356
[4] test-rmse:496588
[5] test-rmse:491874
[6] test-rmse:487210
[7] test-rmse:482584
[8] test-rmse:478012
[9] test-rmse:473483
[10] test-rmse:468999
[11] test-rmse:464566
[12] test-rmse:460175
[13] test-rmse:455822
[14] test-rmse:451526
[15] test-rmse:447271
[16] test-rmse:443066
[17] test-rmse:438893
[18] test-rmse:434766
[19] test-rmse:430688
[20] test-rmse:426657
[21] test-rmse:422664
[22] test-rmse:418718
[23] test-rmse:414801
[24] test-rmse:410941
[25] test-rmse:407098
[26] test-rmse:403307
[27] test-rmse:399547
[28] test-rmse:395829
[29] test-rmse:392152
[30] test-rmse:388512
[31] test-rmse:384923
[32] test-rmse:381353
[33] test-rmse:377832
[34] test-rmse:374343
[35] test-rmse:370902
[36] test-rmse:367481
[37] test-rmse:364100
[38] test-rmse:360764
[39] test-rmse:357461
[40] test-rmse:354187
[41] test-rmse:350960
[42] test-rmse:347750
[43] test-rmse:344572
[44] test-rmse:341426
[45] test-rmse:338314
[46] test-rmse:335245
[47] test-rmse:332213
[48] test-rmse:329190
[49] test-rmse:326207
[50] test-rmse:323269
[51] test-rmse:320358
[52] test-rmse:317476
[53] test-rmse:314619
[54] test-rmse:311801
[55] test-rmse:309001
[56] test-rmse:306233
[57] test-rmse:303498
[58] test-rmse:300790
[59] test-rmse:298113
[60] test-rmse:295470
[61] test-rmse:292855
[62] test-rmse:290269
[63] test-rmse:287704
[64] test-rmse:285165
[65] test-rmse:282666
[66] test-rmse:280184
[67] test-rmse:277741
[68] test-rmse:275313
[69] test-rmse:272907
[70] test-rmse:270532
[71] test-rmse:268190
[72] test-rmse:265866
[73] test-rmse:263562
[74] test-rmse:261295
[75] test-rmse:259046
[76] test-rmse:256816
[77] test-rmse:254625
[78] test-rmse:252448
[79] test-rmse:250296
[80] test-rmse:248164
[81] test-rmse:246063
[82] test-rmse:243969
[83] test-rmse:241918
[84] test-rmse:239880
[85] test-rmse:237853
[86] test-rmse:235862
[87] test-rmse:233881
[88] test-rmse:231939
[89] test-rmse:230004
[90] test-rmse:228095
[91] test-rmse:226206
[92] test-rmse:224346
[93] test-rmse:222499
[94] test-rmse:220670
[95] test-rmse:218861
[96] test-rmse:217075
[97] test-rmse:215311
[98] test-rmse:213567
[99] test-rmse:211827
[100] test-rmse:210120
[101] test-rmse:208421
[102] test-rmse:206737
[103] test-rmse:205088
[104] test-rmse:203457
[105] test-rmse:201827
[106] test-rmse:200212
[107] test-rmse:198636
[108] test-rmse:197085
[109] test-rmse:195530
[110] test-rmse:194010
[111] test-rmse:192494
[112] test-rmse:191011
[113] test-rmse:189524
[114] test-rmse:188077
[115] test-rmse:186631
[116] test-rmse:185212
[117] test-rmse:183809
[118] test-rmse:182411
[119] test-rmse:181043
[120] test-rmse:179675
[121] test-rmse:178325
[122] test-rmse:177006
[123] test-rmse:175698
[124] test-rmse:174401
[125] test-rmse:173124
[126] test-rmse:171857
[127] test-rmse:170612
[128] test-rmse:169374
[129] test-rmse:168161
[130] test-rmse:166952
[131] test-rmse:165766
[132] test-rmse:164596
[133] test-rmse:163434
[134] test-rmse:162284
[135] test-rmse:161156
[136] test-rmse:160034
[137] test-rmse:158914
[138] test-rmse:157821
[139] test-rmse:156745
[140] test-rmse:155676
[141] test-rmse:154628
[142] test-rmse:153593
[143] test-rmse:152568
[144] test-rmse:151559
[145] test-rmse:150558
[146] test-rmse:149572
[147] test-rmse:148603
[148] test-rmse:147644
[149] test-rmse:146701
[150] test-rmse:145766
[151] test-rmse:144831
[152] test-rmse:143911
[153] test-rmse:143000
[154] test-rmse:142102
[155] test-rmse:141215
[156] test-rmse:140345
[157] test-rmse:139482
[158] test-rmse:138627
[159] test-rmse:137799
[160] test-rmse:136970
[161] test-rmse:136155
[162] test-rmse:135347
[163] test-rmse:134549
[164] test-rmse:133771
[165] test-rmse:132997
[166] test-rmse:132246
[167] test-rmse:131489
[168] test-rmse:130746
[169] test-rmse:130024
[170] test-rmse:129296
[171] test-rmse:128587
[172] test-rmse:127886
[173] test-rmse:127192
[174] test-rmse:126505
[175] test-rmse:125824
[176] test-rmse:125160
[177] test-rmse:124501
[178] test-rmse:123857
[179] test-rmse:123216
[180] test-rmse:122583
[181] test-rmse:121954
[182] test-rmse:121339
[183] test-rmse:120737
[184] test-rmse:120148
[185] test-rmse:119561
[186] test-rmse:118982
[187] test-rmse:118408
[188] test-rmse:117840
[189] test-rmse:117286
[190] test-rmse:116739
[191] test-rmse:116198
[192] test-rmse:115670
[193] test-rmse:115143
[194] test-rmse:114633
[195] test-rmse:114128
[196] test-rmse:113628
[197] test-rmse:113133
[198] test-rmse:112648
[199] test-rmse:112167
[200] test-rmse:111694
[201] test-rmse:111232
[202] test-rmse:110769
[203] test-rmse:110309
[204] test-rmse:109870
[205] test-rmse:109429
[206] test-rmse:109001
[207] test-rmse:108584
[208] test-rmse:108159
[209] test-rmse:107745
[210] test-rmse:107338
[211] test-rmse:106934
[212] test-rmse:106543
[213] test-rmse:106161
[214] test-rmse:105774
[215] test-rmse:105404
[216] test-rmse:105032
[217] test-rmse:104666
[218] test-rmse:104306
[219] test-rmse:103951
[220] test-rmse:103605
[221] test-rmse:103256
[222] test-rmse:102918
[223] test-rmse:102581
[224] test-rmse:102258
[225] test-rmse:101929
[226] test-rmse:101614
[227] test-rmse:101305
[228] test-rmse:101001
[229] test-rmse:100687
[230] test-rmse:100393
[231] test-rmse:100106
[232] test-rmse:99803.7
[233] test-rmse:99521.5
[234] test-rmse:99228
[235] test-rmse:98952.9
[236] test-rmse:98687
[237] test-rmse:98407.6
[238] test-rmse:98145.9
[239] test-rmse:97895.6
[240] test-rmse:97630.3
[241] test-rmse:97373.5
[242] test-rmse:97131.5
[243] test-rmse:96879.7
[244] test-rmse:96638.5
[245] test-rmse:96409.2
[246] test-rmse:96174.5
[247] test-rmse:95950.8
[248] test-rmse:95724
[249] test-rmse:95504.3
[250] test-rmse:95286
[251] test-rmse:95063.2
[252] test-rmse:94852.8
[253] test-rmse:94646.3
[254] test-rmse:94438.7
[255] test-rmse:94227.9
[256] test-rmse:94032
[257] test-rmse:93828.1
[258] test-rmse:93637
[259] test-rmse:93447.4
[260] test-rmse:93264
[261] test-rmse:93072.1
[262] test-rmse:92886.1
[263] test-rmse:92699.4
[264] test-rmse:92519.7
[265] test-rmse:92341.1
[266] test-rmse:92158.9
[267] test-rmse:91984.2
[268] test-rmse:91818.9
[269] test-rmse:91667.4
[270] test-rmse:91508.6
[271] test-rmse:91340.9
[272] test-rmse:91179.8
[273] test-rmse:91036.4
[274] test-rmse:90880.2
[275] test-rmse:90730.6
[276] test-rmse:90586.1
[277] test-rmse:90440.3
[278] test-rmse:90301.1
[279] test-rmse:90168.4
[280] test-rmse:90031.9
[281] test-rmse:89908.5
[282] test-rmse:89775.1
[283] test-rmse:89654.3
[284] test-rmse:89526.7
[285] test-rmse:89395.2
[286] test-rmse:89275.8
[287] test-rmse:89160.1
[288] test-rmse:89035.6
[289] test-rmse:88924.4
[290] test-rmse:88812.4
[291] test-rmse:88696.1
[292] test-rmse:88588.3
[293] test-rmse:88483.3
[294] test-rmse:88367.4
[295] test-rmse:88265.7
[296] test-rmse:88159
[297] test-rmse:88060.3
[298] test-rmse:87956.8
[299] test-rmse:87859.5
[300] test-rmse:87763.9
[301] test-rmse:87660.7
[302] test-rmse:87573.6
[303] test-rmse:87475.7
[304] test-rmse:87378.3
[305] test-rmse:87287.8
[306] test-rmse:87194.3
[307] test-rmse:87113.9
[308] test-rmse:87024.7
[309] test-rmse:86936.5
[310] test-rmse:86847.3
[311] test-rmse:86761.9
[312] test-rmse:86679.8
[313] test-rmse:86612.5
[314] test-rmse:86528
[315] test-rmse:86449.6
[316] test-rmse:86374.7
[317] test-rmse:86297.3
[318] test-rmse:86216.9
[319] test-rmse:86147
[320] test-rmse:86085.8
[321] test-rmse:86018.1
[322] test-rmse:85941.5
[323] test-rmse:85878.8
[324] test-rmse:85815.2
[325] test-rmse:85755.3
[326] test-rmse:85691.4
[327] test-rmse:85631.7
[328] test-rmse:85554.2
[329] test-rmse:85478.2
[330] test-rmse:85420.4
[331] test-rmse:85355
[332] test-rmse:85282.9
[333] test-rmse:85212.7
[334] test-rmse:85156.1
[335] test-rmse:85089.2
[336] test-rmse:85042.1
[337] test-rmse:84977.9
[338] test-rmse:84916.3
[339] test-rmse:84865
[340] test-rmse:84819.4
[341] test-rmse:84764.9
[342] test-rmse:84698.7
[343] test-rmse:84655.8
[344] test-rmse:84595.1
[345] test-rmse:84546.1
[346] test-rmse:84496.5
[347] test-rmse:84446.9
[348] test-rmse:84401.1
[349] test-rmse:84349.7
[350] test-rmse:84312.6
[351] test-rmse:84263.9
[352] test-rmse:84217.4
[353] test-rmse:84176.9
[354] test-rmse:84126.8
[355] test-rmse:84081.5
[356] test-rmse:84037.6
[357] test-rmse:84001.1
[358] test-rmse:83961.8
[359] test-rmse:83922.8
[360] test-rmse:83884.8
[361] test-rmse:83842.4
[362] test-rmse:83805.7
[363] test-rmse:83771.6
[364] test-rmse:83738.9
[365] test-rmse:83701.5
[366] test-rmse:83668
[367] test-rmse:83633.7
[368] test-rmse:83591.7
[369] test-rmse:83552.1
[370] test-rmse:83514.7
[371] test-rmse:83479.3
[372] test-rmse:83440.2
[373] test-rmse:83412.3
[374] test-rmse:83380.3
[375] test-rmse:83346.3
[376] test-rmse:83309.6
[377] test-rmse:83272.6
[378] test-rmse:83243.7
[379] test-rmse:83211.3
[380] test-rmse:83184.4
[381] test-rmse:83151.7
[382] test-rmse:83119.6
[383] test-rmse:83089.4
[384] test-rmse:83056.3
[385] test-rmse:83023.5
[386] test-rmse:82994.4
[387] test-rmse:82964.4
[388] test-rmse:82936.3
[389] test-rmse:82907.3
[390] test-rmse:82873.7
[391] test-rmse:82845.4
[392] test-rmse:82816.8
[393] test-rmse:82790.7
[394] test-rmse:82766
[395] test-rmse:82740.9
[396] test-rmse:82719.9
[397] test-rmse:82695.1
[398] test-rmse:82672.3
[399] test-rmse:82647.9
[400] test-rmse:82629.9
[401] test-rmse:82602
[402] test-rmse:82581.2
[403] test-rmse:82562.3
[404] test-rmse:82541
[405] test-rmse:82524.3
[406] test-rmse:82504
[407] test-rmse:82490.7
[408] test-rmse:82472
[409] test-rmse:82448
[410] test-rmse:82424.8
[411] test-rmse:82408.9
[412] test-rmse:82395.4
[413] test-rmse:82373.6
[414] test-rmse:82358.9
[415] test-rmse:82336.1
[416] test-rmse:82322.6
[417] test-rmse:82301.7
[418] test-rmse:82282.6
[419] test-rmse:82268.4
[420] test-rmse:82253.9
[421] test-rmse:82229.1
[422] test-rmse:82207
[423] test-rmse:82188.9
[424] test-rmse:82176.5
[425] test-rmse:82170.7
[426] test-rmse:82157
[427] test-rmse:82151.2
[428] test-rmse:82139.1
[429] test-rmse:82126.4
[430] test-rmse:82108.7
[431] test-rmse:82098.1
[432] test-rmse:82087.3
[433] test-rmse:82075.5
[434] test-rmse:82063.7
[435] test-rmse:82054
[436] test-rmse:82039.1
[437] test-rmse:82027.3
[438] test-rmse:82014.6
[439] test-rmse:82005.3
[440] test-rmse:81993.7
[441] test-rmse:81984.7
[442] test-rmse:81973.3
[443] test-rmse:81955.5
[444] test-rmse:81943.4
[445] test-rmse:81932.8
[446] test-rmse:81918.6
[447] test-rmse:81909.1
[448] test-rmse:81899.2
[449] test-rmse:81886.5
[450] test-rmse:81873.7
[451] test-rmse:81863
[452] test-rmse:81854.2
[453] test-rmse:81842.5
[454] test-rmse:81831.3
[455] test-rmse:81821.2
[456] test-rmse:81811.4
[457] test-rmse:81804.7
[458] test-rmse:81789.8
[459] test-rmse:81784.3
[460] test-rmse:81779.4
[461] test-rmse:81771.3
[462] test-rmse:81756.4
[463] test-rmse:81751.9
[464] test-rmse:81739.6
[465] test-rmse:81730.1
[466] test-rmse:81719.8
[467] test-rmse:81710.2
[468] test-rmse:81701.1
[469] test-rmse:81689.9
[470] test-rmse:81685
[471] test-rmse:81675.6
[472] test-rmse:81670.9
[473] test-rmse:81659.8
[474] test-rmse:81651.6
[475] test-rmse:81641.8
[476] test-rmse:81632.5
[477] test-rmse:81629.2
[478] test-rmse:81619.2
[479] test-rmse:81611
[480] test-rmse:81608
[481] test-rmse:81599.1
[482] test-rmse:81588.6
[483] test-rmse:81578.9
[484] test-rmse:81573.4
[485] test-rmse:81570.2
[486] test-rmse:81558.9
[487] test-rmse:81554.5
[488] test-rmse:81544.7
[489] test-rmse:81533.8
[490] test-rmse:81526.6
[491] test-rmse:81518.9
[492] test-rmse:81512.2
[493] test-rmse:81498.3
[494] test-rmse:81495.6
[495] test-rmse:81488.1
[496] test-rmse:81478.6
[497] test-rmse:81469.1
[498] test-rmse:81463
[499] test-rmse:81462.4
[500] test-rmse:81454.6
[501] test-rmse:81453.7
[502] test-rmse:81450.8
[503] test-rmse:81443
[504] test-rmse:81434
[505] test-rmse:81430.1
[506] test-rmse:81427.5
[507] test-rmse:81421.4
[508] test-rmse:81421.2
[509] test-rmse:81415.5
[510] test-rmse:81413
[511] test-rmse:81409.5
[512] test-rmse:81396.6
[513] test-rmse:81395.1
[514] test-rmse:81395.3
[515] test-rmse:81392.2
[516] test-rmse:81391.4
[517] test-rmse:81388.2
[518] test-rmse:81383.8
[519] test-rmse:81379.3
[520] test-rmse:81379.8
[521] test-rmse:81376.9
[522] test-rmse:81376.9
[523] test-rmse:81375.1
[524] test-rmse:81370.2
[525] test-rmse:81365.3
[526] test-rmse:81364.3
[527] test-rmse:81363.6
[528] test-rmse:81362.5
[529] test-rmse:81358.9
[530] test-rmse:81354.8
[531] test-rmse:81353.9
[532] test-rmse:81355.2
[533] test-rmse:81355.2
[534] test-rmse:81356.4
[535] test-rmse:81356.3
[536] test-rmse:81352.6
[537] test-rmse:81347.5
[538] test-rmse:81347.6
[539] test-rmse:81349.5
[540] test-rmse:81350.3
[541] test-rmse:81351.2
[542] test-rmse:81351.6
Stopping. Best iteration:
[537] test-rmse:81347.5
Let's display whether the features of the model learned earlier were the most important. Looking at this, we can see that features such as long, lat, and sqft_living are very important.
fig, ax = plt.subplots(figsize = (12,12))
xgb.plot_importance(model, max_num_features = 12, height = 0.8, ax = ax)
plt.savefig('house_sails_feature_importance03.png', bbox_inches='tight')
Let's measure the accuracy of the learned model. In r2_score, the accuracy is about 0.847.
from sklearn.metrics import r2_score
preds = model.predict(dtest)
r2 = r2_score(y_test, preds)
print(r2)
0.8473346069012444
Finally, use joblib to save the training model as a pkl file. Now you have a learning model learned by machine learning.
from sklearn.externals import joblib
joblib.dump(model, 'house_sales_model.pkl')
['house_sales_model.pkl']
Here, we are trying to use the learning model generated by machine learning as an API server. Flask, Python's microservices framework, is used for API server development. The flow of development is to build a virtual environment with conda, test a simple API server, and put the learning model created with XGBoost on it.
The virtual environment uses Anaconda's conda. Create a folder for application development (titanic_api in this case) in the terminal and move it to that folder. Then conda create creates the virtual environment and conda activate activates the virtual environment.
mkdir housesails_api
cd housesails_api
conda create -n housesailsenv
conda activate housesailsenv
To develop an API server in Flask, let's first create and test a simple API server. Create the following folders and files in the folder you created earlier. If you can write the following code in each file, start the API server, and communicate from curl, the simple API server test is successful.
Create folders and files so that they have the following hierarchy. If you want to create an empty file, it is convenient to use the touch command.
housesails_api
├── api
│ ├── __init__.py
│ └── views
│ └── user.py
├── housesails_app.py
└── house_sales_model.pkl
Write the code in the file you just created as follows. There are three files needed to test a simple API server: api / views / user.py, api / init.py, and titanic_app.py. It is convenient to use vim when writing in the terminal, and Atom when writing in the GUI.
api/views/user.py
from flask import Blueprint, request, make_response, jsonify
#Routing settings
user_router = Blueprint('user_router', __name__)
#Specify path and HTTP method
@user_router.route('/users', methods=['GET'])
def get_user_list():
return make_response(jsonify({
'users': [
{
'id': 1,
'name': 'John'
}
]
}))
api/__init__.py
from flask import Flask, make_response, jsonify
from .views.user import user_router
def create_app():
app = Flask(__name__)
app.register_blueprint(user_router, url_prefix='/api')
return app
app = create_app()
housesails_app.py
import json
from flask import Flask
from flask import request
from flask import abort
import pandas as pd
from sklearn.externals import joblib
import xgboost as xgb
model = joblib.load("house_sales_model.pkl")
app = Flask(__name__)
# Get headers for payload
headers = ['sqft_living','sqft_above','sqft_basement','lat','long','sqft_living15','grade_3','grade_4','grade_5','grade_6','grade_7','grade_8','grade_9','grade_10','grade_11','grade_12']
@app.route('/house_sails', methods=['POST'])
def housesails():
if not request.json:
abort(400)
payload = request.json['data']
values = [float(i) for i in payload.split(',')]
data1 = pd.DataFrame([values], columns=headers, dtype=float)
predict = model.predict(xgb.DMatrix(data1))
return json.dumps(str(predict[0]))
if __name__ == "__main__":
app.run(debug=True, port=5000)
After rewriting the code, start the API server again with python housesails_app.py. After the API server starts, the communication test is done with the curl command as shown below. If the value after the decimal point 1 is returned for the sent JSON data, it is successful. Now you have a learning model generated by machine learning as an API server.
curl http://localhost:5000/house_sails -s -X POST -H "Content-Type: application/json" -d '{"data": "-1.026685, -0.725963, -0.652987, -0.323607, -0.307144, -0.946801, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0"}'
Finally, add the authentication function to the API server. This time we are implementing basic authentication. HTTP BasicAuth must be installed in the library. You can implement it by adding # BasicAuth to the code of housessails_app.py earlier.
housesails_app.py
import json
from flask import Flask
from flask import request
from flask import abort
from flask_httpauth import HTTPBasicAuth
import pandas as pd
from sklearn.externals import joblib
import xgboost as xgb
model = joblib.load("house_sales_model.pkl")
app = Flask(__name__)
# BasicAuth
auth = HTTPBasicAuth()
users = {
"user01": "password01",
"user02": "password02"
}
@auth.get_password
def get_pw(username):
if username in users:
return users.get(username)
return None
# Get headers for payload
headers = ['sqft_living','sqft_above','sqft_basement','lat','long','sqft_living15','grade_3','grade_4','grade_5','grade_6','grade_7','grade_8','grade_9','grade_10','grade_11','grade_12']
@app.route('/house_sails', methods=['POST'])
# BasicAuth
@auth.login_required
def housesails():
if not request.json:
abort(400)
payload = request.json['data']
values = [float(i) for i in payload.split(',')]
data1 = pd.DataFrame([values], columns=headers, dtype=float)
predict = model.predict(xgb.DMatrix(data1))
return json.dumps(str(predict[0]))
if __name__ == "__main__":
app.run(debug=True, port=5000)
Start the API server again with python housesails_app.py. After the API server starts, the communication test is done with the curl command as shown below. You can be authenticated by adding --user user01: password01. If communication is successful with this, it is successful.
curl http://localhost:5000/house_sails --user user01:password01 -s -X POST -H "Content-Type: application/json" -d '{"data": "-1.026685, -0.725963, -0.652987, -0.323607, -0.307144, -0.946801, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0"}'
Recommended Posts