The input data is classified into two classes using the linear discriminant function below.
f(x)=g(w^{T}x+\theta)\\
here,\theta = x_If you set it to 0,\\
=g(\sum _{i=0}^{d}w_ix_i)
Non-linear activation function: g(a) = \left\{
\begin{array}{ll}
+1 & (a \geq 0) \\
-1 & (a \lt 0)
\end{array}
\right.
Weight: w=(w_0,w_1,...,w_{d})、 \\
Input data: x=(x_0=\theta,x_1,x_2,...,x_{d})、
\\d: number of dimensions,
bias:\theta(Arbitrary number)
f(x) \When geq 0, x\in C_{1}
f(x) \When lt 0, x\in C_{2}
C_1:Class 1, C_2:Class 2
Of the above, the process for determining the parameter of weight_w_ is called "learning". Bias_θ_ (bias parameter) can be used by default, but it is manually adjusted (tuned) for better results. Those that are manually adjusted such as this bias are generally called parameters (hyperparameters or tuning parameters), and adjusting the parameters is sometimes called parameter tuning.
~ Parameter example ~ "If you tune the parameters with a simple perceptron, you will get some results."
In addition, there are various tuning methods, such as a method in which a person simply tries manually, a grid search, and random sampling.
It can be used for linearly separable problems in two-class problems. Roughly speaking, a linearly separable problem is a problem in which a set of class 1 and a set of class 2 can be separated by a single straight line in the case of two dimensions. Generalizing this, the ability to separate a set of class 1 and a set of class 2 in the _n_dimensional space on a n-1 dimensional hyperplane is also called linear separability. The figure below shows an example of a case where linear separability is possible and a case where linear separation is not possible.
Spam email determination is often cited as a specific application of supervised learning algorithms such as Perceptron. Roughly speaking, spam emails are determined by the following procedure.
The feature vector needs to be quantified from qualitative data, leaving the features of the data well. This can significantly change the accuracy, which is a very important task. A specific example of the feature vector creation method is given below.
[Example: Feature vector creation method] I want to determine whether it is spam by the subject, even if there is the following data.
-Spam subject: My boyfriend died fighting a giant anteater ・ Regular email subject: Request to change meeting time
A technique often used in the field of natural language processing is a technique called morphological analysis. MeCab and Cabocha are well-known tools for morphological analysis. By morphological analysis, the input sentence can be divided into morphemes (the smallest unit that has meaning in the language). As a result, the following sentence is completed. (Since it is a morphological analysis by a person who is not good at Japanese, accuracy is not guaranteed)
-Spam subject: Boyfriend / is / Giant anteater / Fighting / Dead / Died / ・ Regular email subject: Meeting / Time / Change / Of / Request /
Next, a morpheme dictionary is created using the above data. As a result, the dictionary looks like this:
dictionary\in\{boyfriend,But,Giant anteater,When,fight,Dead,I'm done,meeting,time,Change,of,request\}
Finally, when the feature vector is created using the dictionary created earlier, it becomes as follows.
Spam email feature vector x_1=\{1,1,1,1,1,1,1,0,0,0,0,0\},Teacher label t=\{-1\}
Normal mail feature vector x_2=\{0,0,0,0,0,0,0,1,1,1,1,1 \},Teacher label t=\{+1\}
As for how the feature vector is created, the presence / absence (1: yes, 0: no) of the appearance of the word included in the dictionary is stored in the array. This method is called the Bag-Of-Words method. In some cases, it is created based on the number of times a word appears, not whether or not a word appears. In addition, as an extension of this, there is also a weighting method called Tf-idf. This is used to reduce the importance of words that appear in many documents and increase the importance of words that appear only in specific documents. Although the teacher label is determined, in general, 0 and 1 are applied when used in a probabilistic model, and -1 and 1 which are compatible with activation functions such as perceptron are often applied.
The above is the procedure for creating the feature vector.
Moreover, the grounds that it can be used in a linear separable problem can be explained by "Perceptron's convergence theorem". See below for proof. http://ocw.nagoya-u.jp/files/253/haifu%2804-4%29.pdf
The concept of the learning method is simple. If you make a mistake, update the value of w using the formula described below, and continue updating until there are no mistakes. The series of training data is defined as follows.
X=\{X_1,X_2,...,X_M\} \\
M:The number of data
Here, a set of misclassified learning data is expressed as follows.
X_n=\{x_1,x_2,...,x_n\}
I want to find the weight w such that _f (x) _> 0 for class 1 data and _f (x) _ <0 for class 2 data. In this case, using the teacher label t = {+ 1, -1}, all the data satisfy the following.
w^Tx_it_i>0
The following error function E (・) that assigns an error of 0 to correctly classified data can be considered.
E(w) =-\sum_{n \in X} w^Tx_nt_n
If you set the value of w so as to minimize this (becomes 0), you can classify well. As a method for minimizing this, a stochastic gradient descent method is used. In the stochastic gradient descent method, when the error function consists of the sum of data points like _E (w) _ this time, when data n is given, w is updated by the following calculation.
w^{(r+1)}=w^{(r)}-\mu \nabla E_n \\
r: number of repetitions,\mu: Learning rate parameter\\
w^{(r)}:Weight w after updating r times
From the above, the update formula can be derived as follows.
w^{(r+1)}=w^{(r)}-\mu \nabla E(w) \\
=w^{(r)}+\mu x_nt_n
In an area where data is misclassified depending on the value of w, the error of misclassified data _E (w) _ regardless of whether the value of t is +1 or -1. Contribution is a linear function. Also, depending on the value of w, the contribution of the data to the error _E (w) _ is 0 within the region where the data is correctly classified. Therefore, _E (w) _ is a piecewise linear function. Therefore, the gradient of _E (w) _ is calculated as follows.
E(w) = \frac{\partial E(w)}{\partial w} \\
=x_nt_n
The code below is a code that I wish I could solve the AND operation, which is neither designed nor sloppy, with Perceptron. The state of learning is now plotted. It may seem like it should be a class or a method, but please understand. I will rewrite it when I feel like it. To run it, you need to install the numpy and matplotlib libraries.
# coding:utf-8
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation
#Inactivation function g(x)method of
def predict(w,x):
out = np.dot(w,x)
if out >= 0:
o = 1.0
else:
o = -1.0
return o
#Method to plot
def plot(wvec,x1,x2):
x_fig=np.arange(-2,5,0.1)
fig = plt.figure(figsize=(8, 8),dpi=100)
ims = []
plt.xlim(-1,2.5)
plt.ylim(-1,2.5)
#Plot
for w in wvec:
y_fig = [-(w[0] / w[1]) * xi - (w[2] / w[1]) for xi in x_fig]
plt.scatter(x1[:,0],x1[:,1],marker='o',color='g',s=100)
plt.scatter(x2[:,0],x2[:,1],marker='s',color='b',s=100)
ims.append(plt.plot(x_fig,y_fig,"r"))
for i in range(10):
ims.append(plt.plot(x_fig,y_fig,"r"))
ani = animation.ArtistAnimation(fig, ims, interval=1000)
plt.show()
if __name__=='__main__':
wvec=[np.array([1.0,0.5,0.8])]#Initial value of weight vector, appropriate
mu = 0.3 #Learning coefficient
sita = 1 #Bias component
#AND function data(The last row is the bias component:1)
x1=np.array([[0,0],[0,1],[1,0]]) #Class 1(The calculation result is 0)Matrix generation
x2=np.array([[1,1]]) #Class 2(The calculation result is 1)Matrix generation
bias = np.array([sita for i in range(len(x1))])
x1 = np.c_[x1,bias] #Concatenate bias to the end of class 1 data
bias = np.array([sita for i in range(len(x2))])
x2 = np.c_[x2,bias] #Concatenate bias to end of class 2 data
class_x = np.r_[x1,x2] #Matrix concatenation
t = [-1,-1,-1,1] #AND function label
# o:Ask for output
o=[]
while t != o:
o = [] #Initialization
#Learning phase
for i in range(class_x.shape[0]):
out = predict(wvec[-1], class_x[i,:])
o.append(out)
if t[i]*out<0: #When the output and teacher label are different
wvectmp = mu*class_x[i,:]*t[i] #Amount to change w
wvec.append(wvec[-1] + wvectmp) #Weight update
plot(wvec,x1,x2)
・ SVM ・ Multilayer perceptron
Pattern recognition and machine learning
First pattern recognition
http://home.hiroshima-u.ac.jp/tkurita/lecture/prnn/node20.html
https://speakerdeck.com/kirikisinya/xin-zhe-renaiprmlmian-qiang-hui-at-ban-zang-men-number-2
http://tjo.hatenablog.com/entry/2013/05/01/190247
http://nlp.dse.ibaraki.ac.jp/~shinnou/zemi1ppt/iwasaki.pdf
http://ocw.nagoya-u.jp/files/253/haifu%2804-4%29.pdf
Recommended Posts