Step-by-step on the theory, implementation in python, and analysis using scikit-learn about the algorithm previously taken up in "Classification of Machine Learning" I will study with. I'm writing it for personal learning, so I'd like you to overlook any mistakes.
So far, two-class classification problems include Simple Perceptron, Logistic Regression, and We have dealt with support vector machines (Basic and Advanced). However, since it was only a two-class classification, let's consider extending them to a multi-class classification.
At first, I was thinking of writing logistic regression and multi-class support vector machines together, but it was surprisingly deep, and there are many sites that did not write it properly, so only the theoretical background is in one article. It has become. As usual, the following sites were helpful. Thank you very much.
Let's think in order.
One-vs-Rest(All) One-vs-Rest (sometimes written as One-vs-All), as the name implies, is a method of classifying into one class and the rest. As an example, in order to classify the three classes of apple, mandarin, and banana, create three classifiers (apple-other), (mandarin-other), and (banana-other) as shown in the figure below.
Actually, there is an area where both apples and bananas can be taken at the boundary of each classifier, but in such a case, it is necessary to use the output strength of each classifier to decide which one to use.
Since it is only necessary to prepare as many classifiers as there are classes, the amount of calculation is less than that of One-vs-One explained next.
One-vs-One Unlike One-vs-Rest, you can choose any two classes and classify them into two classes. As for the number of combinations, assuming that the number of classes is $ n $, a classifier of $ n C_2 $ type is required. For example, when classifying 10 classes, it was okay to have 10 types of classifiers in One-vs-Rest, but in One-vs-One, $ _ {10} C_2 = 45 $ types of classifiers are required. It will be. The final classification is decided by a majority vote of each classifier. If you use the kernel method for your model instead of linear regression, you may use this.
Actually, referring to the scikit-learn document, it seems that it is treated as follows.
Multiclass softmax is often used in neural networks these days. Use the ** softmax function ** to learn which class is most likely to output the model. Before we talk about the softmax function, let's talk about One-hot Encoding first.
One-Hot Encoding To put it simply, One-Hot Encoding refers to a vector where only one is 1 and the other is 0 **. To give an example, a certain feature quantity
fruit |
---|
Apple |
Orange |
Apple |
banana |
Suppose that Rewrite this as follows.
fruit_Apple | fruit_Orange | fruit_banana |
---|---|---|
1 | 0 | 0 |
0 | 1 | 0 |
1 | 0 | 1 |
0 | 0 | 1 |
This form has the advantage that it can be divided into the One-vs-Rest (One) classifications mentioned earlier, and that it is easy to take to the calculation of the softmax function described below. You can also use the Pandas get_dummies () function and the sikit-learn OneHotEncoder class.
The softmax function is a function that transforms multiple outputs into a probability distribution with a sum of 1 (100%). The softmax function has the following shape.
y_i=\frac{e^{x_{i}}}{\sum_{j=1}^{n}e^{x_{j}}} \\
\sum_{i=1}^{n}y_i=1
$ y $ is a $ n $ dimension vector, as is the output. In the previous example, if (apple, mandarin, banana) = [3,8,1] is output, the output of the softmax function is [0.7,99.2,0.1], and the possibility of mandarin is the highest. Will be. By the way, the two-class classification is n = 2 in the above formula.
\begin{align}
y_1&=\frac{e^{x_{1}}}{e^{x_1}+e^{x_2}} \\
&=\frac{\frac{e^{x_{1}}}{e^{x_{1}}}}{\frac{e^{x_{1}}}{e^{x_{1}}}+e^{x_2-x_1}} \\
&=\frac{1}{1+e^{x_2-x_1}}
\end{align}
It will be. This is the sigmoid function itself.
The derivative of the softmax function is
\dfrac{\partial y_i}{\partial x_j}=
\begin{cases}y_i(1-y_i)&i=j\\-y_iy_j&i\neq j\end{cases}
is.
Let the number of classes be $ c $ and the input be $ \ boldsymbol {x} = (x_0, x_1, \ cdots, x_n) $ ($ x_0 $ is bias). Let the parameter be $ \ boldsymbol {W} $ with the size of $ (n + 1) $ × $ c $.
\boldsymbol{z}=\boldsymbol{W}^T\boldsymbol{x}
Optimize $ \ boldsymbol {W} $ with.
To do this, find the ** cross-entropy error function $ E $ **, similar to logistic regression. Assuming that the likelihood function is $ l $, $ l $ can be represented by a probability distribution for all classes and all samples. Let $ \ varphi_ {ij} ^ {t_ {ij}} $ be the output of the softmax function of class $ j $ in the $ i $ th sample.
l=\prod_{i=1}^{n}\prod_{j=1}^{c}\varphi_{ij}^{t_{ij}}
Will be. I want to maximize the likelihood, but take the logarithm of $ l $ and multiply it by -1 to make it a cross-entropy error function.
\begin{align}
E&=-\log(l) \\
&=-\log\left(\prod_{i=1}^{n}\prod_{j=1}^{c}\varphi_{ij}^{t_{ij}}\right) \\
&= -\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{c}t_{ij}\log\varphi_{ij}
\end{align}
Is the loss function. The derivative of the loss function is
\begin{align}
\frac{\partial E}{\partial w} &= -\frac{1}{n}\sum_{i=0}^n(t_{il}-\varphi_{il})x_{ij} \\
&=-\frac{1}{n}\boldsymbol{x}^T(\boldsymbol{t}-\phi)
\end{align}
It will be. (Calculation omitted)
After that, the parameter $ \ boldsymbol {W} $ can be found by using the gradient method to minimize $ E $.
We have summarized how to extend the two-class classification to the multi-class classification. One is to simply repeat the two-class classification. The other was to use the softmax function to find the probability distribution for each class.
It took a long time because there weren't many pages that summarized the methods in this area in detail, probably because it wasn't searched well, and the classification using the softmax function was quite complicated.
From next time on, I'd like to drop it into Python code.