An introduction to machine learning from a simple perceptron

1. What is a simple perceptron?

The input data is classified into two classes using the linear discriminant function below.

f(x)=g(w^{T}x+\theta)\\
here,\theta = x_If you set it to 0,\\
    =g(\sum _{i=0}^{d}w_ix_i)
Non-linear activation function: g(a) = \left\{
\begin{array}{ll}
+1 & (a \geq 0) \\
-1 & (a \lt 0)
\end{array}
\right.
Weight: w=(w_0,w_1,...,w_{d})、 \\
Input data: x=(x_0=\theta,x_1,x_2,...,x_{d})、
\\d: number of dimensions,
bias:\theta(Arbitrary number)
f(x) \When geq 0, x\in C_{1}
f(x) \When lt 0, x\in C_{2}
C_1:Class 1, C_2:Class 2

Of the above, the process for determining the parameter of weight_w_ is called "learning". Bias_θ_ (bias parameter) can be used by default, but it is manually adjusted (tuned) for better results. Those that are manually adjusted such as this bias are generally called parameters (hyperparameters or tuning parameters), and adjusting the parameters is sometimes called parameter tuning.

~ Parameter example ~ "If you tune the parameters with a simple perceptron, you will get some results."

In addition, there are various tuning methods, such as a method in which a person simply tries manually, a grid search, and random sampling.

1.1. What can it be used for?

It can be used for linearly separable problems in two-class problems. Roughly speaking, a linearly separable problem is a problem in which a set of class 1 and a set of class 2 can be separated by a single straight line in the case of two dimensions. Generalizing this, the ability to separate a set of class 1 and a set of class 2 in the _n_dimensional space on a n-1 dimensional hyperplane is also called linear separability. The figure below shows an example of a case where linear separability is possible and a case where linear separation is not possible. senkeibunri.png

Spam email determination is often cited as a specific application of supervised learning algorithms such as Perceptron. Roughly speaking, spam emails are determined by the following procedure.

  1. Assume two classes, spam email and regular email.
  2. Next, create a feature vector (or feature or feature) and a teacher label (spam or normal label) from both emails.
  1. Input the feature vector and teacher label into the learner (this time the simple perceptron) and perform learning.
  2. Make a judgment using the completed mathematical model (this time, the linear discriminant function model).

The feature vector needs to be quantified from qualitative data, leaving the features of the data well. This can significantly change the accuracy, which is a very important task. A specific example of the feature vector creation method is given below.

[Example: Feature vector creation method] I want to determine whether it is spam by the subject, even if there is the following data.

-Spam subject: My boyfriend died fighting a giant anteater ・ Regular email subject: Request to change meeting time

A technique often used in the field of natural language processing is a technique called morphological analysis. MeCab and Cabocha are well-known tools for morphological analysis. By morphological analysis, the input sentence can be divided into morphemes (the smallest unit that has meaning in the language). As a result, the following sentence is completed. (Since it is a morphological analysis by a person who is not good at Japanese, accuracy is not guaranteed)

-Spam subject: Boyfriend / is / Giant anteater / Fighting / Dead / Died / ・ Regular email subject: Meeting / Time / Change / Of / Request /

Next, a morpheme dictionary is created using the above data. As a result, the dictionary looks like this:

dictionary\in\{boyfriend,But,Giant anteater,When,fight,Dead,I'm done,meeting,time,Change,of,request\}

Finally, when the feature vector is created using the dictionary created earlier, it becomes as follows.

Spam email feature vector x_1=\{1,1,1,1,1,1,1,0,0,0,0,0\},Teacher label t=\{-1\}
Normal mail feature vector x_2=\{0,0,0,0,0,0,0,1,1,1,1,1 \},Teacher label t=\{+1\}

As for how the feature vector is created, the presence / absence (1: yes, 0: no) of the appearance of the word included in the dictionary is stored in the array. This method is called the Bag-Of-Words method. In some cases, it is created based on the number of times a word appears, not whether or not a word appears. In addition, as an extension of this, there is also a weighting method called Tf-idf. This is used to reduce the importance of words that appear in many documents and increase the importance of words that appear only in specific documents. Although the teacher label is determined, in general, 0 and 1 are applied when used in a probabilistic model, and -1 and 1 which are compatible with activation functions such as perceptron are often applied.

The above is the procedure for creating the feature vector.

Moreover, the grounds that it can be used in a linear separable problem can be explained by "Perceptron's convergence theorem". See below for proof. http://ocw.nagoya-u.jp/files/253/haifu%2804-4%29.pdf

2. Perceptron learning (stochastic steepest descent algorithm)

The concept of the learning method is simple. If you make a mistake, update the value of w using the formula described below, and continue updating until there are no mistakes. The series of training data is defined as follows.

X=\{X_1,X_2,...,X_M\} \\
M:The number of data

Here, a set of misclassified learning data is expressed as follows.

X_n=\{x_1,x_2,...,x_n\}

I want to find the weight w such that _f (x) _> 0 for class 1 data and _f (x) _ <0 for class 2 data. In this case, using the teacher label t = {+ 1, -1}, all the data satisfy the following.

w^Tx_it_i>0

The following error function E (・) that assigns an error of 0 to correctly classified data can be considered.

E(w) =-\sum_{n \in X} w^Tx_nt_n

If you set the value of w so as to minimize this (becomes 0), you can classify well. As a method for minimizing this, a stochastic gradient descent method is used. In the stochastic gradient descent method, when the error function consists of the sum of data points like _E (w) _ this time, when data n is given, w is updated by the following calculation.

w^{(r+1)}=w^{(r)}-\mu \nabla E_n \\
r: number of repetitions,\mu: Learning rate parameter\\
w^{(r)}:Weight w after updating r times

From the above, the update formula can be derived as follows.

w^{(r+1)}=w^{(r)}-\mu \nabla E(w) \\
=w^{(r)}+\mu x_nt_n 

In an area where data is misclassified depending on the value of w, the error of misclassified data _E (w) _ regardless of whether the value of t is +1 or -1. Contribution is a linear function. Also, depending on the value of w, the contribution of the data to the error _E (w) _ is 0 within the region where the data is correctly classified. Therefore, _E (w) _ is a piecewise linear function. Therefore, the gradient of _E (w) _ is calculated as follows.

E(w) = \frac{\partial E(w)}{\partial w} \\
=x_nt_n 

3. Implementation by Python

The code below is a code that I wish I could solve the AND operation, which is neither designed nor sloppy, with Perceptron. The state of learning is now plotted. It may seem like it should be a class or a method, but please understand. I will rewrite it when I feel like it. To run it, you need to install the numpy and matplotlib libraries. output.gif

# coding:utf-8
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.animation as animation

#Inactivation function g(x)method of
def predict(w,x):
  out = np.dot(w,x)
  if out >= 0:
    o = 1.0
  else:
    o = -1.0
  return o

#Method to plot
def plot(wvec,x1,x2):
  x_fig=np.arange(-2,5,0.1)
  fig = plt.figure(figsize=(8, 8),dpi=100)
  ims = []
  plt.xlim(-1,2.5)
  plt.ylim(-1,2.5)
  #Plot
  for w in wvec:
    y_fig = [-(w[0] / w[1]) * xi - (w[2] / w[1]) for xi in x_fig]
    plt.scatter(x1[:,0],x1[:,1],marker='o',color='g',s=100)
    plt.scatter(x2[:,0],x2[:,1],marker='s',color='b',s=100)
    ims.append(plt.plot(x_fig,y_fig,"r"))
  for i in range(10):
    ims.append(plt.plot(x_fig,y_fig,"r"))
  ani = animation.ArtistAnimation(fig, ims, interval=1000)
  plt.show()

if __name__=='__main__':
  wvec=[np.array([1.0,0.5,0.8])]#Initial value of weight vector, appropriate
  mu = 0.3 #Learning coefficient
  sita = 1 #Bias component

  #AND function data(The last row is the bias component:1)
  x1=np.array([[0,0],[0,1],[1,0]]) #Class 1(The calculation result is 0)Matrix generation
  x2=np.array([[1,1]]) #Class 2(The calculation result is 1)Matrix generation
  bias = np.array([sita for i in range(len(x1))])
  x1 = np.c_[x1,bias] #Concatenate bias to the end of class 1 data
  bias = np.array([sita for i in range(len(x2))])
  x2 = np.c_[x2,bias] #Concatenate bias to end of class 2 data
  class_x = np.r_[x1,x2] #Matrix concatenation

  t = [-1,-1,-1,1] #AND function label
  # o:Ask for output
  o=[]
  while t != o:
    o = [] #Initialization
    #Learning phase
    for i in range(class_x.shape[0]):
      out = predict(wvec[-1], class_x[i,:])
      o.append(out)
      if t[i]*out<0: #When the output and teacher label are different
        wvectmp = mu*class_x[i,:]*t[i] #Amount to change w
        wvec.append(wvec[-1] + wvectmp) #Weight update
  plot(wvec,x1,x2)

4. Typical algorithm that extends simple perceptron

・ SVM ・ Multilayer perceptron

5. References

Pattern recognition and machine learning

First pattern recognition

http://home.hiroshima-u.ac.jp/tkurita/lecture/prnn/node20.html

https://speakerdeck.com/kirikisinya/xin-zhe-renaiprmlmian-qiang-hui-at-ban-zang-men-number-2

http://tjo.hatenablog.com/entry/2013/05/01/190247

http://nlp.dse.ibaraki.ac.jp/~shinnou/zemi1ppt/iwasaki.pdf

http://ocw.nagoya-u.jp/files/253/haifu%2804-4%29.pdf

Recommended Posts

An introduction to machine learning from a simple perceptron
An introduction to machine learning
An introduction to Python for machine learning
Introduction to machine learning
An introduction to machine learning for bot developers
Machine learning algorithm (simple perceptron)
[Super Introduction] Machine learning using Python-From environment construction to implementation of simple perceptron-
Introduction to machine learning Note writing
Introduction to Machine Learning Library SHOGUN
Introduction to Machine Learning: How Models Work
Introduction to ClearML-Easy to manage machine learning experiments-
An Introduction to Object-Oriented-Give an object a child.
Aiming to become a machine learning engineer from sales positions using MOOCs
[Python] Easy introduction to machine learning with python (SVM)
[Super Introduction to Machine Learning] Learn Pytorch tutorials
[Super Introduction to Machine Learning] Learn Pytorch tutorials
[For beginners] Introduction to vectorization in machine learning
Machine Learning_Nonlinearize Simple Perceptron
Free version of DataRobot! ?? Introduction to "PyCaret", a library that automates machine learning
I tried to extract a line art from an image with Deep Learning
A quick introduction to the neural machine translation library
Machine learning python code summary (updated from time to time)
Non-information graduate student studied machine learning from scratch # 1: Perceptron
Machine learning beginners try to make a decision tree
Simple code that gives a score of 0.81339 in Kaggle's Titanic: Machine Learning from Disaster
A quick introduction to pytest-mock
[Learning memorandum] Introduction to vim
An introduction to private TensorFlow
A super introduction to Linux
Build a machine learning environment
Introduction to Deep Learning ~ Learning Rules ~
An introduction to Python Programming
An introduction to Bayesian optimization
Deep Reinforcement Learning 1 Introduction to Reinforcement Learning
Day 68 [Introduction to Kaggle] Random Forest was a simple one.
Introduction to Machine Learning with scikit-learn-From data acquisition to parameter optimization
How to generate a public key from an SSH private key
Machine learning to learn with Nogizaka46 and Keyakizaka46 Part 1 Introduction
I tried to implement Perceptron Part 1 [Deep Learning from scratch]
An example of a mechanism that returns a prediction by HTTP from the result of machine learning
[Machine learning] Understanding linear simple regression from both scikit-learn and mathematics
An introduction to Mercurial for non-engineers
[Machine learning] Understand from mathematics that standardization results in an average of 0 and a standard deviation of 1.
[Machine learning] Understanding uncorrelatedness from mathematics
[Machine learning] Understand from mathematics why the correlation coefficient ranges from -1 to 1.
Newton's method for machine learning (from one variable to multiple variables)
Introduction to Python Basics of Machine Learning (Unsupervised Learning / Principal Component Analysis)
Before the introduction to machine learning. ~ Technology required for machine learning other than machine learning ~
Python learning memo for machine learning by Chainer Chapter 10 Introduction to Cupy
Introduction to Deep Learning ~ Function Approximation ~
Try to extract a character string from an image with Python3
[Introduction] From installing kibana to starting
Machine learning algorithm (simple regression analysis)
Introduction to Deep Learning ~ Coding Preparation ~
Inversely analyze a machine learning model
I read "Reinforcement Learning with Python: From Introduction to Practice" Chapter 1
Convert a string to an image
From Ubuntu 20.04 introduction to environment construction
I want to create a machine learning service without programming! WebAPI
A light introduction to object detection
How to get a job as an engineer from your 30s