Machine learning

It is said that a computer program measures task T (what the application wants to do) with performance index P, and if its performance is improved by experience E (data), it learns from experience E regarding task T and performance index P. Is

Example) In the case of stock price Task T → Predict the next stock price by inputting the input past stock price data Performance index P → Difference between predicted stock price and actual stock price Experience E → Past stock price data

Regression problem

The problem of predicting from the input of a discrete or continuous value to the output of a continuous value

--When predicting with a straight line → Linear regression problem --When predicting with a curve → Non-linear regression problem

Data handled in regression problems

--Input (called explanatory variable or feature of each element) --m-dimensional vector --Output (objective variable) --Scalar value

Explanatory variables: $ x $ = ($ x_1 $, $ x_2 $, $ \ dots $, $ x_m $) $ ^ T $ $ \ in $ $ \ mathbb {R} ^ m $ Objective variable: $ y $ = $ \ in $ $ \ mathbb {R} ^ m $.

Example) House price forecast Explanatory variables: number of rooms, site area and age Objective variable: price

Linear regression

--One of the machine learning models for solving regression problems --Supervised learning --A model that outputs a linear combination of inputs and m-dimensional parameters

Linear combination (inner product of input and parameter)

Parameters: $ w $ = ($ w_1 $, $ w_2 $, ..., $ w_m $) $ ^ T $ $ \ subset $ $ \ mathbb {R} ^ m $ Explanatory variables: $ x $ = ($ x_1 $, $ x_2 $, ..., $ x_m $) $ ^ T $ $ \ subset $ $ \ mathbb {R} ^ m $ Predicted value: $ \ hat {y} $ Linear combination:

\hat{y} = w^Tx + w_0= \sum_{j=1}^{m} w_jx_j + w_0

Data split

Divide the data into training data and validation data to measure the generalization performance of the model Good fit to data to measure model generalization How well you can predict unknown data that doesn't make much sense

Learning (least squares method)

Mean squared error

Square error between data and model output

MSE_{train} = \frac{1}{n_{train}}\sum_{i=1}^{n_{train}}(\hat{y}_i^{(train)}-y_i^{(train)})^2

Least squares

--Search for parameters that minimize the mean square error of the training data --To minimize the mean square error of the training data, find the point where the gradient becomes 0.

The $ \ hat {W} $ (regression coefficient) that differentiates the MSE and solves it so that it becomes 0 is obtained.

\hat{W} = (X^{(train)T}X^{(train)})^{-1}X^{(train)T}y^{(train)}

Therefore, the predicted value $ \ hat {y} $ is

\hat{y}=X\hat{W} = X(X^{(train)T}X^{(train)})^{-1}X^{(train)T}y^{(train)}

Becomes

Linear regression exercise

https://github.com/Tomo-Horiuchi/rabbit/blob/master/Part1/%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92/01_%E7%B7%9A%E5%9E%8B%E5%9B%9E%E5%B8%B0%E3%83%A2%E3%83%87%E3%83%AB/skl_regression.ipynb Reference of model evaluation method https://funatsu-lab.github.io/open-course-ware/basic-theory/accuracy-index/#r2

Exercise results

The data was divided into 7 to 3 and analyzed respectively to obtain the coefficient of determination and mean square error.

--Linear simple regression analysis

Explanatory variable: number of rooms Objective variable: price

MSE Train : 44.983, Test : 40.412 R^2 Train : 0.500, Test : 0.434

--Multiple regression analysis (2 variables)

Explanatory variables: number of rooms, crime rate Objective variable: price MSE Train : 40.586, Test : 34.377 R^2 Train : 0.549, Test : 0.518

Consideration

Since MSE is closer to 0 and the value of $ R ^ 2 $ is closer to 1, it can be seen that multiple regression analysis is more accurate.

Nonlinear regression

It is necessary to perform nonlinear regression modeling for phenomena that have multiple nonlinear structures.

--Linear data is limited --It is not convincing if it is a linear model for non-linear data —— Need a mechanism to capture non-linear structures

Basis expansion method

In the case of non-linearity, modeling is performed using a method called the basis expansion method.

--As a regression function, use a known nonlinear function called a basis function and a linear combination of parameter vectors. --Unknown parameters are estimated by the least squares method or maximum likelihood method as in the linear regression model. --Frequently used basis functions --Polynomial function --Gaussian basis set --Spline function

y_i=f(x_i)+\epsilon_i\qquad y_i=\omega_0+\sum_{i=1}^m\omega_j\phi_j(x_i)+\epsilon_i

Non-linearize $ x $ with $ \ phi $ in a linear map and then look at the linear combination Here $ \ phi $ is the basis function

Polynomial basis functions

\phi_j=x^j

Gaussian basis set

\phi_j(x)=\exp\Biggl(\frac{(x-\mu_j)^2}{2h_j} \Biggr)

Model formula

Basis expansion method can be estimated in the same framework as linear regression

Explanatory variable:

x_i=（x_{i1},x_{i2},\dots,x_{im})\in \mathbb{R}_m

Non-linear function vector:

\phi(xi)=（\phi_1(x_i),\phi_2(x_i),\dots,\phi_k(x_i))^T∈\mathbb{R}^k

Design matrix for nonlinear functions:

\Phi_(train)=(\Phi(x_1),\Phi(x_2),\dots,\Phi(x_n))^T\in \mathbb{R}^{n×k}

Maximum Likelihood Prediction:

\hat{y}=\Phi(\Phi_(train)^T\Phi_(train))^{-1}\Phi_(train)^Ty_(train)

Unlearned and overfitting

Underfitting

A model that does not provide a sufficiently small error for the training data

Countermeasures

--Use a highly expressive model.

Overfitting

A model with a small error but a large difference from the test set error A small error was obtained for the training data, but the error was large for the verification data.

Countermeasures

--Increase the number of training data --Suppress expressiveness by deleting unnecessary basis functions --Suppressing expressiveness using regularization method, etc.

Regularization method

Minimize the function that imposes a regularization term (penalty term) whose value increases with the complexity of the model

S_γ=(y−\Phi w)^T(y−\Phi w)+γR(w)
Regularization term: $ γR (w) $ There are several types depending on the shape, and each has different estimator properties.

$ γ $: Regularization parameter Adjust the smoothness of the curve of the model

Ridge Ridge estimation using the L2 norm for the regularization term Estimate the parameter closer to 0 Called reduced estimation

Lasso regression

Lasso estimates using the L1 norm for the regularization term Estimate some parameters to exactly 0 Called sparse estimation

Holdout method and cross-validation method (cross-validation)

Holdout method

Divide the data into two parts, one for training and one for testing, and use it to estimate prediction accuracy and error rate. If you do not have a large amount of data at hand, there is a drawback that it does not give a good performance evaluation In a nonlinear regression model based on the basis expansion method, the number, position, and tuning of basis functions are determined. Determined by the model that reduces the holdout value

Cross-validation method (cross-validation)

The data is divided into training and verification, the verification data is evaluated, and the average accuracy is called the CV value. The evaluation result is more reliable than the holdout method.

Nonlinear regression exercise

The true function is $ y = 1-48x + 218x ^ 2-315x ^ 3 + 145x ^ 4 $

Generate data by randomly adding noise to a true function Predict true functions from data by linear and non-linear regression, respectively

Exercise results

https://github.com/Tomo-Horiuchi/rabbit/blob/master/Part1/%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92/02_%E9%9D%9E%E7%B7%9A%E5%BD%A2%E5%9B%9E%E5%B8%B0%E3%83%A2%E3%83%87%E3%83%AB/skl_nonlinear%20regression.ipynb Linear regression accuracy: 0.3901811689763751 Nonlinear regression accuracy: 0.8824933990551088 (Basis function is RBF and normalization intensity is 0.0002)

Consideration

Linear regression does not represent a true function Nonlinear regression can roughly represent a true function It turns out that non-linear regression is more accurate If normalization is not performed, the accuracy will be 9.9999, but it will be overfitted from the figure and far from the true function.

Lasso regression makes the parameter sparse, so when I check it, all are 0.

Logistic regression

Logistic regression model is a classification What is a classification problem? A problem of classifying a class from a certain input Data handled by classification Input: $ x = (x_1, x2_2, \ dots, x_m) ^ T \ in \ mathbb {R} ^ m $ (m-dimensional vector) Output: $ y \ in \ Bigl \ {0,1 \ Bigr \} $ (value of 0 or 1) Example) Titanic data, IRIS data If such data is applied as it is in the regression model, it will be a value that has no meaning as a probability. So the logistic regression model takes a linear combination of the input and the m-dimensional parameters as the input to the sigmoid function.

Sigmoid function

A monotonically increasing function that always outputs 0 to 1 when the input is a real number

Sigmoid function $ \ sigma (x) $

\sigma(x)=\frac{1}{1+\exp{(-ax)}}

Feature

-Increasing $ a $ increases the slope of the curve near $ x = 0 $ -Extremely large $ a $ approaches the unit step function --Bias change is the position of the step

Properties of the sigmoid function

--Differentiation can be expressed by itself --Easy to calculate likelihood function

Differentiation of sigmoid function

\frac{\vartheta\sigma(x)}{\vartheta x} = a\sigma(x)(1-\sigma(x))

Because it becomes, it can be expressed by its own function

Formulation

Value you want to find (probability that $ Y = 1 $)

P(Y=1|x)=\sigma(\omega_0+\omega_1 x_1 + \dots + \omega_m x_m)

Can be written as Data $ Y $ is classified as 1 if the probability is 0.50 or more, and 0 if the probability is less than 0.50. Use maximum likelihood estimation when deciding how to think of this formula

Maximum likelihood estimation

Use Bernoulli distribution in logistic regression model When considering a certain distribution, the data generated by that parameter changes. Maximum likelihood estimation is a method of estimating the plausible distribution that would have generated the data from that data. Maximum likelihood estimation is a method of selecting parameters that maximize the likelihood function.

Likelihood function

Fix data and change parameters Probability of $ y = y_1 $ in one trial

P(y)=p^y(1-p)^{1-y}

Probability that $ y_1 to y_n $ will occur at the same time in n trials (fixed to p)

P(y_1,y_2,\dots,y_n;p)= \prod_{i=1}^np^{y_i}(1-p)^{1-y_i}

The data y given here is fixed and p is estimated as a variable. Most likely when p is maximal So you just have to solve the optimization problem for p

Maximum likelihood estimation of logistic regression model

--Probability $ p $ is a sigmoid function, so the parameter confirmation to be estimated is a weight parameter. --Find plausible parameters that generate objective and explanatory variables

P(Y=y_n|x_n)=p^{y_n}(1-p_n)^{1-y_n} = \sigma(w^Tx_n)^{y_n}(1-\sigma(w^Tx_n))^{1-y_n}=L(w_n)

$ W $ is unknown

--Since there are many multiplications, it is easier to calculate the derivative by taking the logarithm. --The point where the log-likelihood function is maximized and the point where the likelihood function is maximized are the same. --Convert the likelihood function minus it into a minimization problem that matches the minimization of the least squares method.

Therefore, take the above $ L (w) $ logarithm and multiply it by minus, and solve it with the minimization problem.

E(w_0,w_1.w_2,\dots,w_n)=-logL(w_0,w_1,w_2.\dots,w_n)

Gradient descent

In the case of logistic regression, it is not possible to find the parameter that minimizes the likelihood function. Therefore, the parameters are updated sequentially by the gradient descent method. However, if this gradient descent method is used as it is, there is a demerit that all input data is required for one parameter update. When the input data becomes huge, problems such as calculation time and insufficient memory become problems. To solve this point, there is a stochastic gradient descent method. Parameters are updated sequentially by iterative learning One of the approaches. Adjust the ease of parameter convergence with the learning rate. Logistic regression Instead, it is necessary to differentiate the log-likelihood function with a parameter to obtain a value that becomes 0, but since it is difficult to obtain it analytically, the gradient descent method is used. Parameters When is not updated, it means that the gradient becomes 0, and the optimum solution is found in the range of iterative learning.

Stochastic gradient descent

An initial value is given to the parameter, the parameter is gradually updated, and when it converges, the parameter at that time is adopted as the optimum value. The learning rate η represents the “step length” of the parameter update. If it is small, it takes time to converge. If it is too large, an event will occur in which the optimum value is "jumped over" (it becomes difficult to find the point that you really want to find the most).

Evaluation of the model

[Confusion matrix]

	Validation data positive	Verification data negative
Expected result positive	True Positive	False negative (False Positive)
Expected result negative	False positive (False Negative)	True Negative

[Compliance rate, recall rate, F value]

Correct answer rate

\frac{TP+TN}{TP+TN+FP+FN)}

Recall

\frac{TP}{TP+FN}

Precision $ \frac{TP}{TP+FP} $ F value (harmonic mean of recall and precision) $ 2 x precision x recall precision + recall $ recall is a precision trade-off relationship

Hands-on [Exercise results] https://github.com/Tomo-Horiuchi/rabbit/blob/master/Part1/%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92/04_%E3%83%AD%E3%82%B8%E3%82%B9%E3%83%86%E3%82%A3%E3%82%AF%E3%82%B9%E5%9B%9E%E5%B8%B0%E3%83%A2%E3%83%87%E3%83%AB/skl_logistic_regression.ipynb

[Discussion] It shows that the higher the fare, the higher the probability of survival. It shows that the higher the rank, the higher the probability of survival. Most high-ranking women survive

Principal component analysis

--Dimensional compression --With fewer variables, less information loss --Can be visualized in 2D and 3D

Required formula

Training data

x_i = (x_ {i1}, x_ {i2}, \ dots, x_ {im}) \ in \ mathbb {R} ^ m (m-dimensional vector)

Mean vector $ \bar{x}= \frac{1}{n}\sum_{i=1}^{n}x_i $ Data matrix $ \bar{X}= (x_1-\bar{x},x_2-\bar{x},\dots,x_n-\bar{x})^T $ Covariance $ \sum=Var(\bar{X})=\frac{1}{n}\bar{X}^T\bar{X} $ Vector after linear return $ s_j=(s_{1j},\dots,s_{nj})^T=\bar{X}a_j $

Way of thinking

Think of the amount of information as the size of the variance Search for the projection axis that changes the coefficient vector and maximizes the variance

The variance after linear return is $ Var(s_j)=\frac{1}{n}s_j^Ts_j=\frac{1}{n}(\bar{X}a_j)^T(\bar{X}a_j)=\frac{1}{n}a_j^T\bar{X}^T\bar{X}a_j=a_j^TVar(\bar{X})a_j $

And constraints $ a_j^Ta_j=1 $

Therefore, the objective function $ a_j^TVar(\bar{X})a_j $ Solve the optimization problem as

Here, the following Lagrange function is used to solve the constrained problem. $ E(a_j)=a_j^TVar(\bar{X})a_j-\lambda(a_j^Ta_j-1) $ When the Lagrange function is differentiated to find the optimum solution $ Var(\bar{X})a_j=\lambda a_j $ It can be seen that the eigenvalues and eigenvectors of the variance-covariance matrix of the original data should be obtained.

Contribution rate

How much information can be retained as a result of compression

Contribution rate $ \ frac {variance of k-th principal component} {total variance of principal component} $ Cumulative contribution rate $ \ frac {variance of the first to k principal components} {total variance of the principal components} $

Hands-on

[Exercise results] https://github.com/Tomo-Horiuchi/rabbit/blob/master/Part1/%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92/03_%E4%B8%BB%E6%88%90%E5%88%86%E5%88%86%E6%9E%90/skl_pca.ipynb

[Discussion] It can be seen that the contribution rate is about 60% when compressed to 2 dimensions and about 70% when compressed to 3 dimensions. It turned out that even if it is compressed to two dimensions, it can be classified to some extent. Before compression, it was classified by 97%, but it can be seen that the accuracy drops when it is compressed to 2 dimensions. Data became easier to understand by deleting dimensions

k-nearest neighbor method (KNN)

Machine learning techniques for classification problems Get K nearest neighbor data and identify to the class to which it belongs most The larger k, the smoother the decision-making association

Hands-on

[Exercise results] https://github.com/Tomo-Horiuchi/rabbit/blob/master/Part1/%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92/05_knn/np_knn.ipynb

[Discussion] It was read that the larger k is, the smoother the decision boundary becomes.

k-means

--Unsupervised learning --Clustering Weekly Report --Classify the given data into k classes

algorithm

―― 1. Set the center value of each class ―― 2. For each data point, calculate the distance from the center of the cluster to be written and assign the cluster closest to it. ―― 3. Calculate the average vector of the cluster to write ―― 4. Repeat a few processes until convergence

Hands-on

[Exercise results] https://github.com/Tomo-Horiuchi/rabbit/blob/master/Part1/%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92/06_k-%E5%B9%B3%E5%9D%87%E6%B3%95/np_kmeans.ipynb

[Discussion] I was able to classify the given data into three classes It was confirmed that the clustering result changes when the value of k is changed.

Machine learning rabbit challenge

Machine learning

Regression problem

Data handled in regression problems

Linear regression

Data split

Learning (least squares method)

Mean squared error

Least squares

Linear regression exercise

Exercise results

Consideration

Nonlinear regression

Basis expansion method

Polynomial basis functions

Gaussian basis set

Model formula

Unlearned and overfitting

Underfitting

Overfitting

Regularization method

Lasso regression

Holdout method and cross-validation method (cross-validation)

Holdout method

Cross-validation method (cross-validation)

Nonlinear regression exercise

Exercise results

Consideration

Logistic regression

Sigmoid function

Properties of the sigmoid function

Formulation

Maximum likelihood estimation

Likelihood function

Maximum likelihood estimation of logistic regression model

Gradient descent

Stochastic gradient descent

Evaluation of the model

Principal component analysis

Required formula

Way of thinking

Contribution rate

Hands-on

k-nearest neighbor method (KNN)

Hands-on

k-means

algorithm

Hands-on