Machine learning algorithms (from two-class classification to multi-class classification)

Introduction

Step-by-step on the theory, implementation in python, and analysis using scikit-learn about the algorithm previously taken up in "Classification of Machine Learning" I will study with. I'm writing it for personal learning, so I'd like you to overlook any mistakes.

So far, two-class classification problems include Simple Perceptron, Logistic Regression, and We have dealt with support vector machines (Basic and Advanced). However, since it was only a two-class classification, let's consider extending them to a multi-class classification.

At first, I was thinking of writing logistic regression and multi-class support vector machines together, but it was surprisingly deep, and there are many sites that did not write it properly, so only the theoretical background is in one article. It has become. As usual, the following sites were helpful. Thank you very much.

2 class classification → Extension to multi-class classification

NOutput for each type of featureysocConsider classifying the types. For example(Apple|Orange|banana)を分類するとか0〜9の数字を分類するとかそういう類のやつsoすね。アルゴリズム自体が多クラス分類に対応しているものもあるんsoすが、ロジスティック回帰やサポートベクターマシンなど2クラス分類器をこの分類問題に対応させるためのアプローチとしては、以下の手法が代表的soす。

Let's think in order.

One-vs-Rest(All) One-vs-Rest (sometimes written as One-vs-All), as the name implies, is a method of classifying into one class and the rest. As an example, in order to classify the three classes of apple, mandarin, and banana, create three classifiers (apple-other), (mandarin-other), and (banana-other) as shown in the figure below. multicalss_1.png

Actually, there is an area where both apples and bananas can be taken at the boundary of each classifier, but in such a case, it is necessary to use the output strength of each classifier to decide which one to use.

Since it is only necessary to prepare as many classifiers as there are classes, the amount of calculation is less than that of One-vs-One explained next.

One-vs-One Unlike One-vs-Rest, you can choose any two classes and classify them into two classes. As for the number of combinations, assuming that the number of classes is $ n $, a classifier of $ n C_2 $ type is required. For example, when classifying 10 classes, it was okay to have 10 types of classifiers in One-vs-Rest, but in One-vs-One, $ _ {10} C_2 = 45 $ types of classifiers are required. It will be. The final classification is decided by a majority vote of each classifier. If you use the kernel method for your model instead of linear regression, you may use this.

Actually, referring to the scikit-learn document, it seems that it is treated as follows.

Multi-class softmax

Multiclass softmax is often used in neural networks these days. Use the ** softmax function ** to learn which class is most likely to output the model. Before we talk about the softmax function, let's talk about One-hot Encoding first.

One-Hot Encoding To put it simply, One-Hot Encoding refers to a vector where only one is 1 and the other is 0 **. To give an example, a certain feature quantity

fruit
Apple
Orange
Apple
banana

Suppose that Rewrite this as follows.

fruit_Apple fruit_Orange fruit_banana
1 0 0
0 1 0
1 0 1
0 0 1

This form has the advantage that it can be divided into the One-vs-Rest (One) classifications mentioned earlier, and that it is easy to take to the calculation of the softmax function described below. You can also use the Pandas get_dummies () function and the sikit-learn OneHotEncoder class.

Softmax function

The softmax function is a function that transforms multiple outputs into a probability distribution with a sum of 1 (100%). The softmax function has the following shape.

y_i=\frac{e^{x_{i}}}{\sum_{j=1}^{n}e^{x_{j}}} \\
\sum_{i=1}^{n}y_i=1

$ y $ is a $ n $ dimension vector, as is the output. In the previous example, if (apple, mandarin, banana) = [3,8,1] is output, the output of the softmax function is [0.7,99.2,0.1], and the possibility of mandarin is the highest. Will be. By the way, the two-class classification is n = 2 in the above formula.

\begin{align}
y_1&=\frac{e^{x_{1}}}{e^{x_1}+e^{x_2}} \\
&=\frac{\frac{e^{x_{1}}}{e^{x_{1}}}}{\frac{e^{x_{1}}}{e^{x_{1}}}+e^{x_2-x_1}} \\
&=\frac{1}{1+e^{x_2-x_1}}
\end{align}

It will be. This is the sigmoid function itself.

Differentiation of softmax function

The derivative of the softmax function is

\dfrac{\partial y_i}{\partial x_j}= 
\begin{cases}y_i(1-y_i)&i=j\\-y_iy_j&i\neq j\end{cases}

is.

Multiclass classification by softmax function

Let the number of classes be $ c $ and the input be $ \ boldsymbol {x} = (x_0, x_1, \ cdots, x_n) $ ($ x_0 $ is bias). Let the parameter be $ \ boldsymbol {W} $ with the size of $ (n + 1) $ × $ c $.

\boldsymbol{z}=\boldsymbol{W}^T\boldsymbol{x}

Optimize $ \ boldsymbol {W} $ with.

To do this, find the ** cross-entropy error function $ E $ **, similar to logistic regression. Assuming that the likelihood function is $ l $, $ l $ can be represented by a probability distribution for all classes and all samples. Let $ \ varphi_ {ij} ^ {t_ {ij}} $ be the output of the softmax function of class $ j $ in the $ i $ th sample.

l=\prod_{i=1}^{n}\prod_{j=1}^{c}\varphi_{ij}^{t_{ij}}

Will be. I want to maximize the likelihood, but take the logarithm of $ l $ and multiply it by -1 to make it a cross-entropy error function.

\begin{align}
E&=-\log(l) \\
&=-\log\left(\prod_{i=1}^{n}\prod_{j=1}^{c}\varphi_{ij}^{t_{ij}}\right) \\
&= -\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{c}t_{ij}\log\varphi_{ij}
\end{align}

Is the loss function. The derivative of the loss function is

\begin{align}
\frac{\partial E}{\partial w} &= -\frac{1}{n}\sum_{i=0}^n(t_{il}-\varphi_{il})x_{ij} \\
&=-\frac{1}{n}\boldsymbol{x}^T(\boldsymbol{t}-\phi)
\end{align}

It will be. (Calculation omitted)

After that, the parameter $ \ boldsymbol {W} $ can be found by using the gradient method to minimize $ E $.

Summary

We have summarized how to extend the two-class classification to the multi-class classification. One is to simply repeat the two-class classification. The other was to use the softmax function to find the probability distribution for each class.

It took a long time because there weren't many pages that summarized the methods in this area in detail, probably because it wasn't searched well, and the classification using the softmax function was quite complicated.

From next time on, I'd like to drop it into Python code.

Recommended Posts

Machine learning algorithms (from two-class classification to multi-class classification)
Machine learning classification
Machine learning algorithm (implementation of multi-class classification)
Introduction to machine learning
Machine learning python code summary (updated from time to time)
An introduction to machine learning from a simple perceptron
An introduction to machine learning
Supervised machine learning (classification / regression)
Super introduction to machine learning
Try to evaluate the performance of machine learning / classification model
Introduction to machine learning Note writing
[Machine learning] Understanding uncorrelatedness from mathematics
Classification and regression in machine learning
Introduction to Machine Learning Library SHOGUN
How to collect machine learning data
[Machine learning] Understand from mathematics why the correlation coefficient ranges from -1 to 1.
Newton's method for machine learning (from one variable to multiple variables)
[Note] AI / machine learning / python related websites [updated from time to time]
Introduction to Machine Learning: How Models Work
scikit-learn How to use summary (machine learning)
Reinforcement learning to learn from zero to deep
Record the steps to understand machine learning
Aiming to become a machine learning engineer from sales positions using MOOCs
[Machine learning] LDA topic classification using scikit-learn
Use machine learning APIs A3RT from Python
I installed Python 3.5.1 to study machine learning
Machine learning
Introduction to ClearML-Easy to manage machine learning experiments-
Image alignment: from SIFT to deep learning
Machine learning algorithm classification and implementation summary
How to enjoy Coursera / Machine Learning (Week 10)
An introduction to Python for machine learning
I want to create a machine learning service without programming! Text classification
EV3 x Pyrhon Machine Learning Part 3 Classification
Try to write code from 1 using the machine learning framework chainer (mnist edition)
[Python] Easy introduction to machine learning with python (SVM)
[Super Introduction to Machine Learning] Learn Pytorch tutorials
An introduction to machine learning for bot developers
Classification of guitar images by machine learning Part 1
Try to forecast power demand by machine learning
Deep Learning from scratch ① Chapter 6 "Techniques related to learning"
Machine learning starting from 0 for theoretical physics students # 1
Python & Machine Learning Study Memo ⑤: Classification of irises
Machine learning starting from scratch (machine learning learned with Kaggle)
[Super Introduction to Machine Learning] Learn Pytorch tutorials
Overview of machine learning techniques learned from scikit-learn
Machine learning starting from 0 for theoretical physics students # 2
Classification of guitar images by machine learning Part 2
progate Python learning memo (updated from time to time)
[For beginners] Introduction to vectorization in machine learning
Arrangement of self-mentioned things related to machine learning
[Keras] I tried to solve a donut-type region classification problem by machine learning [Study]
SVM (multi-class classification)
Sum from 1 to 10
Python Machine Learning Programming Chapter 1 Gives Computers the Ability to Learn from Data Summary
Collect typographical information from Toshiyuki Sakamoto's "Make and Understand! Introduction to Ensemble Learning Algorithms"
[Python] Save PDF from Google Colaboratory to Google Drive! -Let's collect data for machine learning-
[Machine learning] Where will you win this year's Hakone Ekiden? ~ From data to prediction ~
Supervised learning (classification)
[Memo] Machine learning
Machine Learning sample