1. What is a simple perceptron?

The input data is classified into two classes using the linear discriminant function below.

f(x)=g(w^{T}x+\theta)\\
here,\theta = x_If you set it to 0,\\
    =g(\sum _{i=0}^{d}w_ix_i)

Non-linear activation function: g(a) = \left\{
\begin{array}{ll}
+1 & (a \geq 0) \\
-1 & (a \lt 0)
\end{array}
\right.

Weight: w=(w_0,w_1,...,w_{d})、　\\
Input data: x=(x_0=\theta,x_1,x_2,...,x_{d})、
\\d: number of dimensions,

bias:\theta(Arbitrary number)

f(x) \When geq 0, x\in C_{1}

f(x) \When lt 0, x\in C_{2}

C_1:Class 1, C_2:Class 2

Of the above, the process for determining the parameter of weight_w_ is called "learning". Bias_θ_ (bias parameter) can be used by default, but it is manually adjusted (tuned) for better results. Those that are manually adjusted such as this bias are generally called parameters (hyperparameters or tuning parameters), and adjusting the parameters is sometimes called parameter tuning.

~ Parameter example ~ "If you tune the parameters with a simple perceptron, you will get some results."

In addition, there are various tuning methods, such as a method in which a person simply tries manually, a grid search, and random sampling.

1.1. What can it be used for?

It can be used for linearly separable problems in two-class problems. Roughly speaking, a linearly separable problem is a problem in which a set of class 1 and a set of class 2 can be separated by a single straight line in the case of two dimensions. Generalizing this, the ability to separate a set of class 1 and a set of class 2 in the _n_dimensional space on a n-1 dimensional hyperplane is also called linear separability. The figure below shows an example of a case where linear separability is possible and a case where linear separation is not possible.

Spam email determination is often cited as a specific application of supervised learning algorithms such as Perceptron. Roughly speaking, spam emails are determined by the following procedure.

Assume two classes, spam email and regular email.
Next, create a feature vector (or feature or feature) and a teacher label (spam or normal label) from both emails.

As an aside, in the field of natural language processing, the term "feature" is often used.

Input the feature vector and teacher label into the learner (this time the simple perceptron) and perform learning.
Make a judgment using the completed mathematical model (this time, the linear discriminant function model).

The feature vector needs to be quantified from qualitative data, leaving the features of the data well. This can significantly change the accuracy, which is a very important task. A specific example of the feature vector creation method is given below.

[Example: Feature vector creation method] I want to determine whether it is spam by the subject, even if there is the following data.

-Spam subject: My boyfriend died fighting a giant anteater ・ Regular email subject: Request to change meeting time

A technique often used in the field of natural language processing is a technique called morphological analysis. MeCab and Cabocha are well-known tools for morphological analysis. By morphological analysis, the input sentence can be divided into morphemes (the smallest unit that has meaning in the language). As a result, the following sentence is completed. (Since it is a morphological analysis by a person who is not good at Japanese, accuracy is not guaranteed)

-Spam subject: Boyfriend / is / Giant anteater / Fighting / Dead / Died / ・ Regular email subject: Meeting / Time / Change / Of / Request /

Next, a morpheme dictionary is created using the above data. As a result, the dictionary looks like this:

dictionary\in\{boyfriend,But,Giant anteater,When,fight,Dead,I'm done,meeting,time,Change,of,request\}

The above is not a general method because it is created by using the data of a dictionary created by someone in the field of natural language processing or by manually inputting it. When creating this dictionary, we will improve the accuracy by converting all morphemes into basic forms if they are verbs and putting them in the dictionary.

Finally, when the feature vector is created using the dictionary created earlier, it becomes as follows.

Spam email feature vector x_1=\{1,1,1,1,1,1,1,0,0,0,0,0\},Teacher label t=\{-1\}

Normal mail feature vector x_2=\{0,0,0,0,0,0,0,1,1,1,1,1 \},Teacher label t=\{+1\}

As for how the feature vector is created, the presence / absence (1: yes, 0: no) of the appearance of the word included in the dictionary is stored in the array. This method is called the Bag-Of-Words method. In some cases, it is created based on the number of times a word appears, not whether or not a word appears. In addition, as an extension of this, there is also a weighting method called Tf-idf. This is used to reduce the importance of words that appear in many documents and increase the importance of words that appear only in specific documents. Although the teacher label is determined, in general, 0 and 1 are applied when used in a probabilistic model, and -1 and 1 which are compatible with activation functions such as perceptron are often applied.

The above is the procedure for creating the feature vector.

Moreover, the grounds that it can be used in a linear separable problem can be explained by "Perceptron's convergence theorem". See below for proof. http://ocw.nagoya-u.jp/files/253/haifu%2804-4%29.pdf

2. Perceptron learning (stochastic steepest descent algorithm)

The concept of the learning method is simple. If you make a mistake, update the value of w using the formula described below, and continue updating until there are no mistakes. The series of training data is defined as follows.

X=\{X_1,X_2,...,X_M\} \\
M:The number of data

Here, a set of misclassified learning data is expressed as follows.

X_n=\{x_1,x_2,...,x_n\}

I want to find the weight w such that _f (x) _> 0 for class 1 data and _f (x) _ <0 for class 2 data. In this case, using the teacher label t = {+ 1, -1}, all the data satisfy the following.

w^Tx_it_i>0

The following error function E (・) that assigns an error of 0 to correctly classified data can be considered.

E(w) =-\sum_{n \in X} w^Tx_nt_n

If you set the value of w so as to minimize this (becomes 0), you can classify well. As a method for minimizing this, a stochastic gradient descent method is used. In the stochastic gradient descent method, when the error function consists of the sum of data points like _E (w) _ this time, when data n is given, w is updated by the following calculation.

w^{(r+1)}=w^{(r)}-\mu \nabla E_n \\
r: number of repetitions,\mu: Learning rate parameter\\
w^{(r)}:Weight w after updating r times

From the above, the update formula can be derived as follows.

w^{(r+1)}=w^{(r)}-\mu \nabla E(w) \\
=w^{(r)}+\mu x_nt_n

In an area where data is misclassified depending on the value of w, the error of misclassified data _E (w) _ regardless of whether the value of t is +1 or -1. Contribution is a linear function. Also, depending on the value of w, the contribution of the data to the error _E (w) _ is 0 within the region where the data is correctly classified. Therefore, _E (w) _ is a piecewise linear function. Therefore, the gradient of _E (w) _ is calculated as follows.

E(w) = \frac{\partial E(w)}{\partial w} \\
=x_nt_n

3. Implementation by Python

The code below is a code that I wish I could solve the AND operation, which is neither designed nor sloppy, with Perceptron. The state of learning is now plotted. It may seem like it should be a class or a method, but please understand. I will rewrite it when I feel like it. To run it, you need to install the numpy and matplotlib libraries.

# coding:utf-8
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation

#Inactivation function g(x)method of
def predict(w,x):
  out = np.dot(w,x)
  if out >= 0:
    o = 1.0
  else:
    o = -1.0
  return o

#Method to plot
def plot(wvec,x1,x2):
  x_fig=np.arange(-2,5,0.1)
  fig = plt.figure(figsize=(8, 8),dpi=100)
  ims = []
  plt.xlim(-1,2.5)
  plt.ylim(-1,2.5)
  #Plot
  for w in wvec:
    y_fig = [-(w[0] / w[1]) * xi - (w[2] / w[1]) for xi in x_fig]
    plt.scatter(x1[:,0],x1[:,1],marker='o',color='g',s=100)
    plt.scatter(x2[:,0],x2[:,1],marker='s',color='b',s=100)
    ims.append(plt.plot(x_fig,y_fig,"r"))
  for i in range(10):
    ims.append(plt.plot(x_fig,y_fig,"r"))
  ani = animation.ArtistAnimation(fig, ims, interval=1000)
  plt.show()

if __name__=='__main__':
  wvec=[np.array([1.0,0.5,0.8])]#Initial value of weight vector, appropriate
  mu = 0.3 #Learning coefficient
  sita = 1 #Bias component

  #AND function data(The last row is the bias component:1)
  x1=np.array([[0,0],[0,1],[1,0]]) #Class 1(The calculation result is 0)Matrix generation
  x2=np.array([[1,1]]) #Class 2(The calculation result is 1)Matrix generation
  bias = np.array([sita for i in range(len(x1))])
  x1 = np.c_[x1,bias] #Concatenate bias to the end of class 1 data
  bias = np.array([sita for i in range(len(x2))])
  x2 = np.c_[x2,bias] #Concatenate bias to end of class 2 data
  class_x = np.r_[x1,x2] #Matrix concatenation

  t = [-1,-1,-1,1] #AND function label
  # o:Ask for output
  o=[]
  while t != o:
    o = [] #Initialization
    #Learning phase
    for i in range(class_x.shape[0]):
      out = predict(wvec[-1], class_x[i,:])
      o.append(out)
      if t[i]*out<0: #When the output and teacher label are different
        wvectmp = mu*class_x[i,:]*t[i] #Amount to change w
        wvec.append(wvec[-1] + wvectmp) #Weight update
  plot(wvec,x1,x2)

4. Typical algorithm that extends simple perceptron

・ SVM ・ Multilayer perceptron

5. References

Pattern recognition and machine learning

First pattern recognition

http://home.hiroshima-u.ac.jp/tkurita/lecture/prnn/node20.html

https://speakerdeck.com/kirikisinya/xin-zhe-renaiprmlmian-qiang-hui-at-ban-zang-men-number-2

http://tjo.hatenablog.com/entry/2013/05/01/190247

http://nlp.dse.ibaraki.ac.jp/~shinnou/zemi1ppt/iwasaki.pdf

http://ocw.nagoya-u.jp/files/253/haifu%2804-4%29.pdf