It is said that a computer program measures task T (what the application wants to do) with performance index P, and if its performance is improved by experience E (data), it learns from experience E regarding task T and performance index P. Is
Example) In the case of stock price Task T → Predict the next stock price by inputting the input past stock price data Performance index P → Difference between predicted stock price and actual stock price Experience E → Past stock price data
The problem of predicting from the input of a discrete or continuous value to the output of a continuous value
--When predicting with a straight line → Linear regression problem --When predicting with a curve → Non-linear regression problem
--Input (called explanatory variable or feature of each element) --m-dimensional vector --Output (objective variable) --Scalar value
Explanatory variables: $ x $ = ($ x_1 $, $ x_2 $, $ \ dots $, $ x_m $) $ ^ T $ $ \ in $ $ \ mathbb {R} ^ m $ Objective variable: $ y $ = $ \ in $ $ \ mathbb {R} ^ m $.
Example) House price forecast Explanatory variables: number of rooms, site area and age Objective variable: price
--One of the machine learning models for solving regression problems --Supervised learning --A model that outputs a linear combination of inputs and m-dimensional parameters
Linear combination (inner product of input and parameter)
Parameters: $ w $ = ($ w_1 $, $ w_2 $, ..., $ w_m $) $ ^ T $ $ \ subset $ $ \ mathbb {R} ^ m $ Explanatory variables: $ x $ = ($ x_1 $, $ x_2 $, ..., $ x_m $) $ ^ T $ $ \ subset $ $ \ mathbb {R} ^ m $ Predicted value: $ \ hat {y} $ Linear combination:
\hat{y} = w^Tx + w_0= \sum_{j=1}^{m} w_jx_j + w_0
Divide the data into training data and validation data to measure the generalization performance of the model Good fit to data to measure model generalization How well you can predict unknown data that doesn't make much sense
Square error between data and model output
MSE_{train} = \frac{1}{n_{train}}\sum_{i=1}^{n_{train}}(\hat{y}_i^{(train)}-y_i^{(train)})^2
--Search for parameters that minimize the mean square error of the training data --To minimize the mean square error of the training data, find the point where the gradient becomes 0.
The $ \ hat {W} $ (regression coefficient) that differentiates the MSE and solves it so that it becomes 0 is obtained.
\hat{W} = (X^{(train)T}X^{(train)})^{-1}X^{(train)T}y^{(train)}
Therefore, the predicted value $ \ hat {y} $ is
\hat{y}=X\hat{W} = X(X^{(train)T}X^{(train)})^{-1}X^{(train)T}y^{(train)}
Becomes
https://github.com/Tomo-Horiuchi/rabbit/blob/master/Part1/%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92/01_%E7%B7%9A%E5%9E%8B%E5%9B%9E%E5%B8%B0%E3%83%A2%E3%83%87%E3%83%AB/skl_regression.ipynb Reference of model evaluation method https://funatsu-lab.github.io/open-course-ware/basic-theory/accuracy-index/#r2
The data was divided into 7 to 3 and analyzed respectively to obtain the coefficient of determination and mean square error.
--Linear simple regression analysis
Explanatory variable: number of rooms Objective variable: price
MSE Train : 44.983, Test : 40.412
--Multiple regression analysis (2 variables)
Explanatory variables: number of rooms, crime rate
Objective variable: price
MSE Train : 40.586, Test : 34.377
Since MSE is closer to 0 and the value of $ R ^ 2 $ is closer to 1, it can be seen that multiple regression analysis is more accurate.
It is necessary to perform nonlinear regression modeling for phenomena that have multiple nonlinear structures.
--Linear data is limited --It is not convincing if it is a linear model for non-linear data —— Need a mechanism to capture non-linear structures
In the case of non-linearity, modeling is performed using a method called the basis expansion method.
--As a regression function, use a known nonlinear function called a basis function and a linear combination of parameter vectors. --Unknown parameters are estimated by the least squares method or maximum likelihood method as in the linear regression model. --Frequently used basis functions --Polynomial function --Gaussian basis set --Spline function
y_i=f(x_i)+\epsilon_i\qquad y_i=\omega_0+\sum_{i=1}^m\omega_j\phi_j(x_i)+\epsilon_i
Non-linearize $ x $ with $ \ phi $ in a linear map and then look at the linear combination Here $ \ phi $ is the basis function
\phi_j=x^j
\phi_j(x)=\exp\Biggl(\frac{(x-\mu_j)^2}{2h_j} \Biggr)
Basis expansion method can be estimated in the same framework as linear regression
Explanatory variable:
x_i=(x_{i1},x_{i2},\dots,x_{im})\in \mathbb{R}_m
Non-linear function vector:
\phi(xi)=(\phi_1(x_i),\phi_2(x_i),\dots,\phi_k(x_i))^T∈\mathbb{R}^k
Design matrix for nonlinear functions:
\Phi_(train)=(\Phi(x_1),\Phi(x_2),\dots,\Phi(x_n))^T\in \mathbb{R}^{n×k}
Maximum Likelihood Prediction:
\hat{y}=\Phi(\Phi_(train)^T\Phi_(train))^{-1}\Phi_(train)^Ty_(train)
A model that does not provide a sufficiently small error for the training data
Countermeasures
--Use a highly expressive model.
A model with a small error but a large difference from the test set error A small error was obtained for the training data, but the error was large for the verification data.
Countermeasures
--Increase the number of training data --Suppress expressiveness by deleting unnecessary basis functions --Suppressing expressiveness using regularization method, etc.
Minimize the function that imposes a regularization term (penalty term) whose value increases with the complexity of the model
Regularization term: $ γR (w) $
There are several types depending on the shape, and each has different estimator properties.
$ γ $: Regularization parameter Adjust the smoothness of the curve of the model
Ridge Ridge estimation using the L2 norm for the regularization term Estimate the parameter closer to 0 Called reduced estimation
Lasso estimates using the L1 norm for the regularization term Estimate some parameters to exactly 0 Called sparse estimation
Divide the data into two parts, one for training and one for testing, and use it to estimate prediction accuracy and error rate. If you do not have a large amount of data at hand, there is a drawback that it does not give a good performance evaluation In a nonlinear regression model based on the basis expansion method, the number, position, and tuning of basis functions are determined. Determined by the model that reduces the holdout value
The data is divided into training and verification, the verification data is evaluated, and the average accuracy is called the CV value. The evaluation result is more reliable than the holdout method.
The true function is $ y = 1-48x + 218x ^ 2-315x ^ 3 + 145x ^ 4 $
Generate data by randomly adding noise to a true function Predict true functions from data by linear and non-linear regression, respectively
https://github.com/Tomo-Horiuchi/rabbit/blob/master/Part1/%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92/02_%E9%9D%9E%E7%B7%9A%E5%BD%A2%E5%9B%9E%E5%B8%B0%E3%83%A2%E3%83%87%E3%83%AB/skl_nonlinear%20regression.ipynb Linear regression accuracy: 0.3901811689763751 Nonlinear regression accuracy: 0.8824933990551088 (Basis function is RBF and normalization intensity is 0.0002)
Linear regression does not represent a true function Nonlinear regression can roughly represent a true function It turns out that non-linear regression is more accurate If normalization is not performed, the accuracy will be 9.9999, but it will be overfitted from the figure and far from the true function.
Lasso regression makes the parameter sparse, so when I check it, all are 0.
Logistic regression model is a classification What is a classification problem? A problem of classifying a class from a certain input Data handled by classification Input: $ x = (x_1, x2_2, \ dots, x_m) ^ T \ in \ mathbb {R} ^ m $ (m-dimensional vector) Output: $ y \ in \ Bigl \ {0,1 \ Bigr \} $ (value of 0 or 1) Example) Titanic data, IRIS data If such data is applied as it is in the regression model, it will be a value that has no meaning as a probability. So the logistic regression model takes a linear combination of the input and the m-dimensional parameters as the input to the sigmoid function.
A monotonically increasing function that always outputs 0 to 1 when the input is a real number
Sigmoid function $ \ sigma (x) $
Feature
-Increasing $ a $ increases the slope of the curve near $ x = 0 $ -Extremely large $ a $ approaches the unit step function --Bias change is the position of the step
--Differentiation can be expressed by itself --Easy to calculate likelihood function
Differentiation of sigmoid function
Because it becomes, it can be expressed by its own function
Value you want to find (probability that $ Y = 1 $)
Can be written as Data $ Y $ is classified as 1 if the probability is 0.50 or more, and 0 if the probability is less than 0.50. Use maximum likelihood estimation when deciding how to think of this formula
Use Bernoulli distribution in logistic regression model When considering a certain distribution, the data generated by that parameter changes. Maximum likelihood estimation is a method of estimating the plausible distribution that would have generated the data from that data. Maximum likelihood estimation is a method of selecting parameters that maximize the likelihood function.
Fix data and change parameters Probability of $ y = y_1 $ in one trial
Probability that $ y_1 to y_n $ will occur at the same time in n trials (fixed to p)
The data y given here is fixed and p is estimated as a variable. Most likely when p is maximal So you just have to solve the optimization problem for p
--Probability $ p $ is a sigmoid function, so the parameter confirmation to be estimated is a weight parameter. --Find plausible parameters that generate objective and explanatory variables
$ W $ is unknown
--Since there are many multiplications, it is easier to calculate the derivative by taking the logarithm. --The point where the log-likelihood function is maximized and the point where the likelihood function is maximized are the same. --Convert the likelihood function minus it into a minimization problem that matches the minimization of the least squares method.
Therefore, take the above $ L (w) $ logarithm and multiply it by minus, and solve it with the minimization problem.
In the case of logistic regression, it is not possible to find the parameter that minimizes the likelihood function. Therefore, the parameters are updated sequentially by the gradient descent method. However, if this gradient descent method is used as it is, there is a demerit that all input data is required for one parameter update. When the input data becomes huge, problems such as calculation time and insufficient memory become problems. To solve this point, there is a stochastic gradient descent method. Parameters are updated sequentially by iterative learning One of the approaches. Adjust the ease of parameter convergence with the learning rate. Logistic regression Instead, it is necessary to differentiate the log-likelihood function with a parameter to obtain a value that becomes 0, but since it is difficult to obtain it analytically, the gradient descent method is used. Parameters When is not updated, it means that the gradient becomes 0, and the optimum solution is found in the range of iterative learning.
An initial value is given to the parameter, the parameter is gradually updated, and when it converges, the parameter at that time is adopted as the optimum value. The learning rate η represents the “step length” of the parameter update. If it is small, it takes time to converge. If it is too large, an event will occur in which the optimum value is "jumped over" (it becomes difficult to find the point that you really want to find the most).
[Confusion matrix]
Validation data positive | Verification data negative | |
---|---|---|
Expected result positive | True Positive | False negative (False Positive) |
Expected result negative | False positive (False Negative) | True Negative |
[Compliance rate, recall rate, F value]
Correct answer rate
Recall
Precision
Hands-on [Exercise results] https://github.com/Tomo-Horiuchi/rabbit/blob/master/Part1/%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92/04_%E3%83%AD%E3%82%B8%E3%82%B9%E3%83%86%E3%82%A3%E3%82%AF%E3%82%B9%E5%9B%9E%E5%B8%B0%E3%83%A2%E3%83%87%E3%83%AB/skl_logistic_regression.ipynb
[Discussion] It shows that the higher the fare, the higher the probability of survival. It shows that the higher the rank, the higher the probability of survival. Most high-ranking women survive
--Dimensional compression --With fewer variables, less information loss --Can be visualized in 2D and 3D
Training data
Mean vector
Think of the amount of information as the size of the variance Search for the projection axis that changes the coefficient vector and maximizes the variance
The variance after linear return is
And constraints
Therefore, the objective function
Here, the following Lagrange function is used to solve the constrained problem.
How much information can be retained as a result of compression
Contribution rate
[Exercise results] https://github.com/Tomo-Horiuchi/rabbit/blob/master/Part1/%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92/03_%E4%B8%BB%E6%88%90%E5%88%86%E5%88%86%E6%9E%90/skl_pca.ipynb
[Discussion] It can be seen that the contribution rate is about 60% when compressed to 2 dimensions and about 70% when compressed to 3 dimensions. It turned out that even if it is compressed to two dimensions, it can be classified to some extent. Before compression, it was classified by 97%, but it can be seen that the accuracy drops when it is compressed to 2 dimensions. Data became easier to understand by deleting dimensions
Machine learning techniques for classification problems Get K nearest neighbor data and identify to the class to which it belongs most The larger k, the smoother the decision-making association
[Exercise results] https://github.com/Tomo-Horiuchi/rabbit/blob/master/Part1/%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92/05_knn/np_knn.ipynb
[Discussion] It was read that the larger k is, the smoother the decision boundary becomes.
--Unsupervised learning --Clustering Weekly Report --Classify the given data into k classes
―― 1. Set the center value of each class ―― 2. For each data point, calculate the distance from the center of the cluster to be written and assign the cluster closest to it. ―― 3. Calculate the average vector of the cluster to write ―― 4. Repeat a few processes until convergence
[Exercise results] https://github.com/Tomo-Horiuchi/rabbit/blob/master/Part1/%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92/06_k-%E5%B9%B3%E5%9D%87%E6%B3%95/np_kmeans.ipynb
[Discussion] I was able to classify the given data into three classes It was confirmed that the clustering result changes when the value of k is changed.
Recommended Posts