Click here until yesterday
You will become an engineer in 100 days --Day 76 --Programming --About machine learning
You will become an engineer in 100 days-Day 70-Programming-About scraping
You will become an engineer in 100 days --Day 66 --Programming --About natural language processing
You become an engineer in 100 days --Day 63 --Programming --About probability 1
You will become an engineer in 100 days-Day 59-Programming-Algorithms
You will become an engineer in 100 days --- Day 53 --Git --About Git
You will become an engineer in 100 days --Day 42 --Cloud --About cloud services
You will become an engineer in 100 days --Day 36 --Database --About the database
You will be an engineer in 100 days --Day 24 --Python --Basics of Python language 1
You will become an engineer in 100 days --Day 18 --Javascript --JavaScript basics 1
You will become an engineer in 100 days --Day 14 --CSS --CSS Basics 1
You will become an engineer in 100 days --Day 6 --HTML --HTML basics 1
This time is a continuation of the story about machine learning.
I will explain what you can do with machine learning for the first time, but what you can do with machine learning There are basically three.
・ Regression ・ Classification ・ Clustering
Roughly speaking, it becomes prediction
, but the part of what to predict
changes.
・ Regression: Predict numerical values ・ Classification: Predict categories ・ Clustering: Make it feel good
The regression model
goes to predict the numbers.
The data used this time is the Boston house price data attached to scikit-learn
.
column | Description |
---|---|
CRIM | Crime rate per capita by town |
ZN | The ratio of residential land is 25,Parcels over 000 square feet |
INDUS | Percentage of non-retail acres per town |
CHAS | Charlie's river dummy variable (1 if at river boundary, 0 otherwise) |
NOX | Nitric oxide concentration (1 in 10 million) |
RM | Average number of rooms per dwelling unit |
AGE | Age ratio of owned and occupied units built before 1940 |
DIS | Weighted distances to five Boston employment centers |
RAD | Indicator of accessibility to radial highways |
TAX | 10,Full property tax rate per $ 000 |
PTRATIO | Student teacher ratio |
B | Percentage of blacks in town |
LSTAT | Low rate per capita |
MEDV | Median Owner-Resident Homes at $ 1000 |
The MEDV
is the objective variable
that you want to predict, and the others are the explanatory variables
.
First, let's see what kind of data it is.
from sklearn.datasets import load_boston
#Data reading
boston = load_boston()
#Creating a data frame
boston_df = pd.DataFrame(data=boston.data,columns=boston.feature_names)
boston_df['MEDV'] = boston.target
#Data overview
print(boston_df.shape)
boston_df.head()
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.09 | 1 | 296 | 15.3 | 396.9 | 4.98 | 24 |
1 | 0.02731 | 0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242 | 17.8 | 396.9 | 9.14 | 21.6 |
2 | 0.02729 | 0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 0.03237 | 0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 0.06905 | 0 | 2.18 | 0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222 | 18.7 | 396.9 | 5.33 | 36.2 |
It contains numerical data.
Let's visualize to see the relationship between each column.
sns.pairplot(data=boston_df[list(boston_df.columns[0:6])+['MEDV']])
plt.show()
sns.pairplot(data=boston_df[list(boston_df.columns[6:13])+['MEDV']])
plt.show()
Let's also look at the correlation of each column.
boston_df.corr()
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CRIM | 1 | -0.200469 | 0.406583 | -0.055892 | 0.420972 | -0.219247 | 0.352734 | -0.37967 | 0.625505 | 0.582764 | 0.289946 | -0.385064 | 0.455621 | -0.388305 |
ZN | -0.200469 | 1 | -0.533828 | -0.042697 | -0.516604 | 0.311991 | -0.569537 | 0.664408 | -0.311948 | -0.314563 | -0.391679 | 0.17552 | -0.412995 | 0.360445 |
INDUS | 0.406583 | -0.533828 | 1 | 0.062938 | 0.763651 | -0.391676 | 0.644779 | -0.708027 | 0.595129 | 0.72076 | 0.383248 | -0.356977 | 0.6038 | -0.483725 |
CHAS | -0.055892 | -0.042697 | 0.062938 | 1 | 0.091203 | 0.091251 | 0.086518 | -0.099176 | -0.007368 | -0.035587 | -0.121515 | 0.048788 | -0.053929 | 0.17526 |
NOX | 0.420972 | -0.516604 | 0.763651 | 0.091203 | 1 | -0.302188 | 0.73147 | -0.76923 | 0.611441 | 0.668023 | 0.188933 | -0.380051 | 0.590879 | -0.427321 |
RM | -0.219247 | 0.311991 | -0.391676 | 0.091251 | -0.302188 | 1 | -0.240265 | 0.205246 | -0.209847 | -0.292048 | -0.355501 | 0.128069 | -0.613808 | 0.69536 |
AGE | 0.352734 | -0.569537 | 0.644779 | 0.086518 | 0.73147 | -0.240265 | 1 | -0.747881 | 0.456022 | 0.506456 | 0.261515 | -0.273534 | 0.602339 | -0.376955 |
DIS | -0.37967 | 0.664408 | -0.708027 | -0.099176 | -0.76923 | 0.205246 | -0.747881 | 1 | -0.494588 | -0.534432 | -0.232471 | 0.291512 | -0.496996 | 0.249929 |
RAD | 0.625505 | -0.311948 | 0.595129 | -0.007368 | 0.611441 | -0.209847 | 0.456022 | -0.494588 | 1 | 0.910228 | 0.464741 | -0.444413 | 0.488676 | -0.381626 |
TAX | 0.582764 | -0.314563 | 0.72076 | -0.035587 | 0.668023 | -0.292048 | 0.506456 | -0.534432 | 0.910228 | 1 | 0.460853 | -0.441808 | 0.543993 | -0.468536 |
PTRATIO | 0.289946 | -0.391679 | 0.383248 | -0.121515 | 0.188933 | -0.355501 | 0.261515 | -0.232471 | 0.464741 | 0.460853 | 1 | -0.177383 | 0.374044 | -0.507787 |
B | -0.385064 | 0.17552 | -0.356977 | 0.048788 | -0.380051 | 0.128069 | -0.273534 | 0.291512 | -0.444413 | -0.441808 | -0.177383 | 1 | -0.366087 | 0.333461 |
LSTAT | 0.455621 | -0.412995 | 0.6038 | -0.053929 | 0.590879 | -0.613808 | 0.602339 | -0.496996 | 0.488676 | 0.543993 | 0.374044 | -0.366087 | 1 | -0.737663 |
MEDV | -0.388305 | 0.360445 | -0.483725 | 0.17526 | -0.427321 | 0.69536 | -0.376955 | 0.249929 | -0.381626 | -0.468536 | -0.507787 | 0.333461 | -0.737663 | 1 |
With the exception of some columns, the correlation between each column does not seem to be that high.
The regression model
is that you want to rely on the value of one objective variable
using these columns.
** Data split **
First, split the data for training and testing. This time we will split at 6: 4.
from sklearn.model_selection import train_test_split
#6 for training and test data:Divided by 4
X = boston_df.drop('MEDV',axis=1)
Y = boston_df['MEDV']
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=0)
** Modeling ** Next, create a forecast model.
Here, we will use the linear regression model
to build a regression model with the objective variable MEDV
and the explanatory variables CRIM ~ LSTAT
.
Roughly speaking, Y = x1 * a + x2 * b + ... ϵ
It is an image of making an expression like that.
ʻA and
bare called
regression coefficients`, and each variable is
It shows how much it contributes to the prediction of the objective variable.
ϵ
is called residual
, and represents the degree of deviation
between each data and the expression.
In the linear regression model
, the sum of the residual squares
of each data
Find each coefficient by minimizing it.
The library used is called linear_model
.
from sklearn import linear_model
#Learning with linear regression
model = linear_model.LinearRegression()
model.fit(x_train, y_train)
Modeling is done immediately by calling the library and doing fit
.
** Accuracy verification **
In the accuracy verification of the regression model, we will look at how much the prediction and the actual measurement are different.
As a commonly used index
Mean squared error (MSE)
and
Root mean squared error (RMSE)
There is a R-squared value
$ R ^ 2 $.
The `MSE is the average value of the sum of squares of the error, and if both the training data and the test data are small, it is judged that the performance of the model is good.
`RMSE is the square root of the mean squared error.
The R-squared value
$ R ^ 2 $ takes 1 when MSE
is 0, and the better the model performance, the closer to 1.
from sklearn.metrics import mean_squared_error
y_pred = model.predict(x_test)
print('MSE : {0} '.format(mean_squared_error(y_test, y_pred)))
print('RMSE : {0} '.format(np.sqrt(mean_squared_error(y_test, y_pred))))
print('R^2 : {0}'.format(model.score(x_test, y_test)))
MSE : 25.79036215070245 RMSE : 5.078421226198399 R^2 : 0.6882607142538019
By the way, looking at the accuracy, the value of RMSE
is off by about 5.0.
On average, there is an error of this deviation from the house price.
** Residual plot **
By the way, how much did the prediction model deviate?
Let's visualize the residual
.
#Plot the residuals
plt.scatter(y_pred, y_pred - y_test, c = 'red', marker = 's')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
# y =Draw a straight line to 0
plt.hlines(y = 0, xmin = -10, xmax = 50, lw = 2, color = 'blue')
plt.xlim([10, 50])
plt.show()
By combining the test data and the prediction data, you can see how much the deviation is. Those that are out of alignment are quite out of alignment.
With this kind of feeling, create a model so that there is less deviation, select data, preprocess, adjust model parameter values, etc. We aim to improve accuracy with less error.
Today I explained how the regression model works. There are many other regression models.
First of all, let's start by saying what regression is, and suppress how to model and verify.
19 days until you become an engineer
Otsu py's HP: http://www.otupy.net/
Youtube: https://www.youtube.com/channel/UCaT7xpeq8n1G_HcJKKSOXMw
Twitter: https://twitter.com/otupython
Recommended Posts