What is supervised learning?

Types of machine learning

Machine learning is divided into three main areas.

1, supervised learning Machines predict new and future data based on accumulated data Or it means to classify. This includes stock price forecasts and image identification.

2, unsupervised learning It means that the machine finds out the structure and relationship of the accumulated data. It is used in retail store customer trends and Google image recognition.

3, Reinforcement learning The learning form is similar to unsupervised learning, but at the time of learning by setting rewards and goals It is a method of learning to maximize profits. It is often used as a competitive AI such as Go.

Of these, supervised learning can be broadly divided into two methods: regression and classification.

Machine learning with scikit-learn

We will use scikit-learn, which is a module for machine learning.


#Import the required modules.
import request
from sklearn.linear_model import LinearRegression

#Next, load the data you want to train. See the code in this issue for detailed code.
#Train as follows_X, test_X, train_y, test_The data is loaded in four files called y.
train_X, test_X, train_y, test_y = (Data information)

#Build a learner.
#A learning device is a learning model(Learning method)An object designed to train along with.
# scikit-Learn's Linear Regression learns and returns predictive data.
#The details of this Linear Regression will be dealt with in the next and subsequent sessions.
model = LinearRegression()

#Teacher data(Existing data for learning)Let the learner learn using.
model.fit(train_X, train_y)

#Let the learner make predictions using test data prepared separately from the teacher data.
pred_y = model.predict(test_X)

#An evaluation value called the coefficient of determination is calculated to confirm the performance of the learner.
score = model.score(test_X, test_y)

Linear regression

What is linear regression?

Regression analysis is based on the relationship between the data you want to predict and the data you already know. It's an estimation approach. Ultimately, we call it "regression" when predicting numbers.

It is easy to understand how many kilometers you ran at 100km / h in one hour after returning (predicting). Among them, 100 is the coefficient of the data.

In linear regression, for the data you want to predict by looking at the coefficients of the data used for prediction You can see the magnitude of the contribution of that data.

In looking at the magnitude of the contribution of data By creating a formula that maximizes profits from shopping and purchasing Creating a calculation formula is essential so that you can understand what measures to take.

Coefficient of determination

The coefficient of determination is the data predicted by linear regression and the actual data. It is an index showing how well they match. It also shows how much you can trust the coefficients (magnitude of contribution) of each data.

If the predicted score is 70 Actually, if it is 20 points, the coefficient of determination will be close to 0. Actually, when the score is 71, it will be as close to 1 as possible.

The coefficient of determination takes a number from 0 to 1, and the larger the value, the better the accuracy of the function. If the value is about 0.8 or more, the accuracy of the function can be seen as good. However, a number less than or equal to 0.8 does not mean that the function is useless.

If the coefficient of determination is of a certain magnitude (the standard varies from person to person, but about 0.4 or more), the magnitude of the contribution of the data is reliable to some extent.

Linear simple regression

Linear simple regression is a regression analysis that obtains one data (ex. Amount of water) to be predicted from one data (ex. Time). It is often used when investigating data relationships and rarely when making predictions.

Here, the data you want to predict is yy, and the data used for prediction is xx.

y=ax+by=ax+b Estimate aa and bb, assuming that there is a relationship.

There are various methods for estimating aa and bb, but this time we will use a method called the least squares method. Make sure that the sum of the squares of the difference between the actual yy value and the estimated y (= ax + b) y (= ax + b) value is minimized. How to determine aa and bb.

In the figure below, determine a and b so that the sum of the distances from the orange data points is minimized. In this way, draw the closest straight line to the existing data and infer future data from that straight line.

The reason for squared the error here is to prevent the error from being offset by the difference between positive and negative. For example, if you simply add the errors of +2 and -2, the value will be 0 and the error will be offset.

Now, in order to actually perform regression analysis, it is convenient to use a model called LinearRegression in the linear_model module of scikit-learn.

from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

#Here, we will generate recursive data.
X, y = make_regression(n_samples=100, n_features=1, n_targets=1, noise=5.0, random_state=42)

train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=42)

model = LinearRegression()

model.fit(train_X, train_y)

#Output of coefficient of determination
print(model.score(test_X, test_y))

Linear multiple regression

Linear multiple regression means that one piece of data you want to predict (ex2. Restaurant's overall evaluation score) This is a regression analysis in which multiple data are used for prediction (ex2. Food deliciousness score and customer service goodness score). High prediction accuracy can be obtained when the relationships between the data used for prediction are weak.

Again, we use the least squares method to estimate the relationship between the predicted data and the data used for the prediction. In the case of multiple regression, the data used for prediction is x0x0, x1x1, x2x2 ...

y=β0x0+β1x1+β2x2+⋯+ϵy=β0x0+β1x1+β2x2+⋯+ϵ

We will estimate β0, β1, β2 ..., ϵ β0, β1, β2 ..., ϵ.

You can see that we have more x than simple regression.

Linear multiple regression also uses a model called LinearRegression in the linear_model module of scikit-learn. It is possible to perform regression analysis. Automatically best fits existing data β0, β1, β2 ..., ϵ β0, β1, β2 ..., ϵ are determined and predicted.

from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

#Here n_features=Generate x by setting 10
#The number of x actually used is n_informative=Specify as 3
X, y = make_regression(n_samples=100, n_features=10, n_informative=3, n_targets=1, noise=5.0, random_state=42)
train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=42)

model = LinearRegression()
model.fit(train_X, train_y)
model.score(test_X, test_y)
#Also, model.predict(test_X)By writing test_You can make predictions for X.

Python: Supervised Learning (Regression)