Simple neural network theory and implementation

I needed to explain the code of the neural network (NN), so I decided to take this opportunity to summarize it briefly. Below, we will proceed assuming that the basic knowledge about NN is known.

Reference material

[Learning and Neural Network (Electronic Information and Communication Engineering Series)](http://www.amazon.co.jp/gp/product/4627702914/ref=as_li_tf_tl?ie=UTF8&camp=247&creative=1211&creativeASIN=4627702914&linkCode=as2&tag=shimashimao06- 22) <img src = "http://ir-jp.amazon-adsystem.com/e/ir?t=shimashimao06-22&l=as2&o=9&a=4627702914" width="1" height="1" border=" 0 "alt =" "style =" border: none! Important; margin: 0px! Important; "/>

What i did

BP theory

symbol

Input to the network: $ (x_ {1}, x_ {2}, ..., x_ {N}) $ Output from network: $ (y_ {1}, y_ {2}, ..., y_ {M}) $ Network I / O relationship: $ (y_ {1}, y_ {2}, ..., y_ {M}) = F (x_ {1}, x_ {2}, ..., x_ {N}) $ Join weight: {$ w_ {ij} } Threshold: { {\ theta_ {i}} $}

Overview of BP

Benefits of BP

The above reference book has the following description as to why BP has become widely used.

  • Calculation of the partial derivative of the evaluation function for each parameter tends to be complicated in a general nonlinear system, but there is a systematic and systematic calculation method in BP. Moreover, the information required for determining the amount of correction of each parameter can be transmitted to the required location by using the connection structure known by the network. This means that there is no need to provide a special communication line for learning, which is convenient for simplifying algorithms and hardware.
  • The function approximation capability of the network is theoretically guaranteed. If a sufficient number of intermediate layer elements are used, any function can be approximated with high accuracy by appropriately determining parameter values (connection weight, threshold value). It is not always possible to find a parameter value that gives the optimum approximation by BP, but the proof of a kind of existence theorem gives a sense of security.

Function approximation

Perform the following operations.

Function approximation in a single neuron

Since the sequential update learning method is derived, the gradient method is applied with the goal of reducing the evaluation scale of the following equation for each training data ($ x ^ {(l)}, y ^ {(l)} $). E(w_0, w_1, ..., w_N)=|y^{(l)}-\hat{y}^{(l)}|^2 s=\sum_{n=0}^{N}w_nx_n^{(l)} \hat{y}^{(l)}=sigmoid(s)

The following equation is obtained by differentiating the error evaluation scale $ E (w_0, w_1, ..., w_N) $ by $ w_n $. \frac{\partial E(w_0, w_1, ..., w_N)}{\partial w_n}=-2(y^{(l)}-\hat{y}^{(l)})sigmoid'(s)x_n^{(l)} here, sigmoid(s)=\frac{1}{1+e^{-\alpha s}}

When this is differentiated, sigmoid'(s)=\frac{\alpha e^{-\alpha s}}{(1+e^{-\alpha s})^2}

To summarize this further, \frac{\alpha e^{-\alpha s}}{(1+e^{-\alpha s})^2} =\alpha \frac{1}{1+e^{-\alpha s}}(1-\frac{1}{1+e^{-\alpha s}}) =\alpha sigmoid(s)(1-sigmoid(s)) =\alpha \hat{y}^{(l)}(1-\hat{y}^{(l)})

Using the value of $ \ partial E (w_0, w_1, ..., w_N) / \ partial w_n $ obtained by the above, the connection weight is repeatedly corrected by the following equation. w_n=w_n-\epsilon \frac{\partial E(w_0, w_1, ..., w_N)}{\partial w_n}

Derivation

Chain rule

image The small change of x spreads to the change of other variable values in a chain reaction according to the dependency of the variables in the above figure. According to the chain rule, the following relational expression holds between the minute changes of these variables $ \ Delta x, \ Delta z_1, \ Delta x_2, \ Delta y $. \Delta z_1=\frac{\partial z_1}{\partial x}\Delta x \Delta z_2=\frac{\partial z_2}{\partial x}\Delta x \Delta y=\frac{\partial y}{\partial z_1}\Delta z_1 + \frac{\partial y}{\partial z_2}\Delta z_2

To summarize the above formula, \frac{\partial y}{\partial x}=\frac{\partial y}{\partial z_1}\frac{\partial z_1}{\partial x} + \frac{\partial y}{\partial z_2}\frac{\partial z_2}{\partial x}

3 layer NN

Based on the above chain rule, the BP of NN consisting of the input layer, intermediate layer, and output layer is derived. The number of units for all three layers is 3 (4 units for the input layer and intermediate layer, considering the bias term). From the unit of the output layer, number the output layer: {1,2,3}, the intermediate layer: {4,5,6}, and the input layer: {7,8,9}. And

  1. When the weight $ w_2 $ from the 4th unit in the middle layer to the 2nd unit in the output layer changes slightly
  2. When the weight $ w_4 $ from the 5th unit of the input layer to the 4th unit of the intermediate layer changes slightly.

BP is derived using these two cases as examples. An image of each propagation is shown below.

image

image

In case of 1 above

The relationship between the amount of change in $ w_2 $ $ \ Delta w_2 $ and the amount of change in $ s_2 $ $ \ Delta s_2 $ is shown by the following equation. \Delta s_2=\Delta w_2y_4 Also, y_2=sigmoid(s_2) Than \Delta y_2=sigmoid'(s_2)\Delta s_2

This change in $ y_2 $ changes the value of the error evaluation scale, and the following relational expression holds. \Delta E=2(y_2-t_2)\Delta y_2 Here, $ t_2 $ refers to the correct answer data of $ y_2 $. If you organize the formula, \Delta E=2(y_2-t_2)sigmoid'(s_2)y_4\Delta w_2 \frac{\partial E}{\partial w_2}=2(y_2-t_2)sigmoid'(s_2)y_4

From the above equation, the partial differential coefficient $ \ partial E / \ partial w_2 $ required to correct $ w_2 $ by the gradient method was obtained.

Also, from $ \ Delta s_2 = y_4 \ Delta w_2 $ \frac{\partial E}{\partial s_2}=2(y_2-t_2)sigmoid'(s_2) Is also sought at the same time. These values can be used as they are when calculating the partial differential coefficient when the weight one layer above changes.

In case of 2 above

The relationship of the amount of change when $ w_4 $ is changed is calculated by the following formula. \Delta s_4=\Delta w_4 y_5 \Delta y_4=sigmoid'(s_4) \Delta s_4

The change in $ y_4 $ that occurs in this way affects $ s_1, s_2, s_3 $ at the connection destination. Therefore, the following relational expression holds between the amounts of change. \Delta s_1=w_1 \Delta y_4 \Delta s_2=w_2 \Delta y_4 \Delta s_3=w_3 \Delta y_4

The above change also changes the error evaluation scale as shown in the following equation. \Delta E=\frac{\partial E}{\partial s_1} \Delta s_1+\frac{\partial E}{\partial s_2} \Delta s_2+\frac{\partial E}{\partial s_3} \Delta s_3

Organize this and divide both sides by $ \ Delta s_4 $ to \frac{\partial E}{\partial s_4}=(\frac{\partial E}{\partial s_1}w_1+\frac{\partial E}{\partial w_2}s_2+\frac{\partial E}{\partial s_3}w_3)sigmoid'(s_4)

Is sought. It can be seen that the above formula is a kind of recurrence formula in which $ \ partial E / \ partial s_i $ in the middle layer is obtained by $ \ partial E / \ partial s_j $ in the output layer. Similarly, it can be seen that the error can be propagated to the lower layers by finding $ \ partial E / \ partial s $ of the final layer for any number of NNs.

Finally, \frac{\partial s_4}{\partial w_4}=y_5 To \frac{\partial E}{\partial w_4}=\frac{\partial E}{\partial s_4}\frac{\partial s_4}{\partial w_4} Substitute in \frac{\partial E}{\partial w_4}=\frac{\partial E}{\partial s_4}y_5 To get. $ sigmoid'(s) $ simplifies the calculation by using $ y = sigmoid (s) $ shown in the single neuron part.

Reference code

Sorry for the pretty dirty code, but I'll paste it below. The threshold (bias) is fixed for all neurons for simplicity. When actually creating code, add an element with a value of 1 to the beginning of the input vector, add a bias element to the weight vector, and adjust the parameters as part of the weight.

python


# coding: utf-8

import numpy as np
Afrom numpy.random import randint
import sys


class NN:
    def __init__(self):
        self.alph = 0.04
        self.mu = 0.01
        self.theta = 0.1
        self.w = []
        self.output = []
        self.output_sigm = []
        self.T = 0

    def create_data(self, input_n_row, input_n_col, layer_sizes):
        self.x = randint(2, size=(input_n_row, input_n_col))
        self.y = randint(2, size=(input_n_row, layer_sizes[-1]))
        for i_layer, size in enumerate(layer_sizes):
            if i_layer == 0:
                self.w.append(np.random.randn(input_n_col, size))
                self.output.append(np.zeros((input_n_row, size)))
                self.output_sigm.append(np.zeros((input_n_row ,size)))
            else:
                self.w.append(np.random.randn(layer_sizes[i_layer-1], size))
                self.output.append(np.zeros((input_n_row, size)))
                self.output_sigm.append(np.zeros((input_n_row ,size)))

    def fit(self, eps=10e-6):
        error = sys.maxint
        self.forward()
        while error>eps:
            self.update( self.backword() )
            self.forward()
            error = self.calculate_error()
            self.T += 1
            print "T=", self.T
            print "error", error

    def calculate_error(self):
        return np.sum( np.power(self.y - self.output_sigm[-1], 2) )

    def forward(self):
        for i_layer in xrange(len(self.output)):
            if i_layer == 0:
                self.output[i_layer] = self.x.dot(self.w[i_layer])
                self.output_sigm[i_layer] = self.sigmoid(self.output[i_layer])
            else:
                self.output[i_layer] = self.output_sigm[i_layer-1].dot(self.w[i_layer])
                self.output_sigm[i_layer] = self.sigmoid(self.output[i_layer])

    def backword(self):
        result = []
        for i_layer in range(len(self.w))[::-1]:
            if i_layer==len(self.w)-1:
                result.insert(0, self.diff(self.output_sigm[i_layer], self.y) )
            else:
                result.insert(0, self.diff_mult( self.output_sigm[i_layer], result[0].dot(self.w[i_layer+1].T)) )
        return result

    def update(self, diff):
        for i_layer in range(len(self.w))[::-1]:
            if i_layer==0:
                for i_row in xrange(len(diff[i_layer])):
                    self.w[i_layer] -= self.get_incremental_update_value(
                                              self.x[i_row].reshape(len(self.w[i_layer]),1),
                                              diff[i_layer][i_row,:].reshape(1,self.w[i_layer].shape[1])
                                          )
            else:
                for i_row in xrange(len(diff[i_layer])):
                    self.w[i_layer] -= self.get_incremental_update_value(
                                              self.output_sigm[i_layer-1][i_row,:].reshape(len(self.w[i_layer]),1),
                                              diff[i_layer][i_row,:].reshape(1,self.w[i_layer].shape[1])
                                          )

    def get_incremental_update_value(self, input_data, diff):
        return np.kron(input_data, self.mu*diff)

    def diff(self, y, t):
        return self.alph * 2*(y - t) * self.dsigmoid(y)

    def diff_mult(self, y, prp_value):
        return self.alph * self.dsigmoid(y) * prp_value

    def sigmoid(self, s, alph=0.01):
        return 1/(1+np.exp(-self.alph*(s-self.theta)))

    def dsigmoid(self, y):
        return y * (1 - y)

if __name__=='__main__':
    layer_sizes = (4,3)
    input_layer_size = 3 
    input_data_size = 1000

    nn = NN()
    nn.create_data(input_data_size, input_layer_size, layer_sizes)
    nn.fit()

We apologize for the inconvenience, but we would appreciate it if you could point out any mistakes.

Recommended Posts

Simple neural network theory and implementation
Simple neural network implementation using Chainer-Data preparation-
Simple neural network implementation using Chainer-Model description-
Normalizing Flow theory and implementation
Neural network implementation in python
Neural network implementation (NumPy only)
Neural network with OpenCV 3 and Python 3
Implementation of a two-layer neural network 2
PRML Chapter 5 Neural Network Python Implementation
Simple classification model with neural network
Explanation and implementation of simple perceptron
Implementation of 3-layer neural network (no learning)
Implementation of "blurred" neural network using Chainer
2. Mean and standard deviation with neural network!
Parametric Neural Network
Author estimation using neural network and Doc2Vec (Aozora Bunko)
Implement Convolutional Neural Network
Bayesian optimization implementation of neural network hyperparameters (Chainer + GPyOpt)
Implement Neural Network from 1
Q-learning practice and theory
Convolutional neural network experience
Perceptron basics and implementation
Implementation of a convolutional neural network using only Numpy
Generalized linear model (GLM) and neural network are the same (1)
Rank learning using neural network (Implementation of RankNet by Chainer)
Image classification with self-made neural network by Keras and PyTorch
Neural network to understand and implement in high school mathematics
Generalized linear model (GLM) and neural network are the same (2)
Theory and implementation of multiple regression models-why regularization is needed-
Implement a 3-layer neural network
Neural network with Python (scikit-learn)
3. Normal distribution with neural network!
Explanation and implementation of SocialFoceModel
Asymptotic theory and its simulation (2)
Neural network starting with Chainer
Pytorch Neural Network (CNN) Tutorial 1.3.1.
4. Circle parameters with neural network!
Maxout description and implementation (Python)
TensorFlow Tutorial-Convolutional Neural Network (Translation)
I made a simple network camera by combining ESP32-CAM and RTSP.
Simple network operation by linking Cisco DNA Center and Tact Switch
[Python] I thoroughly explained the theory and implementation of logistic regression
[Python] I thoroughly explained the theory and implementation of decision trees