Non-information graduate student studied machine learning from scratch # 2: Neural network

Introduction

A non-information graduate student studied machine learning from scratch. Write in an article to keep a record of what you have studied. I will decide how to proceed while doing it, but for the time being, I will gradually step up from the basics while tracing the famous "Deep-Learning made from scratch". The environment will be operated by Google Colab. Part 2 is a neural network.

table of contents

  1. Activation function
  2. Output layer activation function
  3. Actually build a neural network

1. Activation function

In Part 1, the output of the biased perceptron was expressed as follows. $ y = \begin{cases} 0 & (w_1x_1+w_2x_2+b \leq 0) \\\ 1 & (w_1x_1+w_2x_2+b > 0) \end{cases} $ We extend this simply and generally by introducing the function $ h (x) . $ y = h(a) = \begin{cases} 0 & (a \leq 0) \
1 & (a > 0), \end{cases} \
a = w_1x_1+w_2x_2+b \
$$ This $ h (x) $ is called the activation function. As the name implies, it determines how the function is activated (fired). The above formula is an example of using a function in which 0 and 1 change discontinuously at the origin, which was also used in Part 1. Here are some functions that are commonly used for activation functions.

Step function

The function that switches the output depending on whether the input exceeds 0 or does not exceed 0, which has been explained so far, is called a step function. In other words

h(x) = \begin{cases} 0 & (x \leq 0) \\\ 1 & (x > 0) \end{cases}

It is represented by.

Step function


def step_function(x):
    return np.array(x > 0, dtype=np.int)

Sigmoid function

The Sigmoid function is one of the most commonly used activation functions for neural networks and is expressed by the following formula.

h(x) = \frac{1}{1+\exp(-x)}

Compared to the step function, the step function is a discontinuous function, so it becomes indistinguishable at $ x = 0 $. Also, the differential value is always 0 except for $ x = 0 $, which means that there is no change (actually, the value changes discontinuously, but it cannot be expressed). On the other hand, the Sigmoid function has the property that the output becomes 0 (approaches) when the input is small and becomes 1 (approaches) when it is large, but since it is a continuous function, it is differentiable and the value between 0 and 1 Can also be used continuously. Due to these characteristics, the Sigmoid function is used as the activation function of the neural network instead of the step function.

Sigmoid function


def sigmoid(x):
    return 1 / (1 + np.exp(-x))

ReLU function

The ReLU function is an activation function that has become popular in recent neural networks and is expressed by the following formula.

h(x) = \begin{cases} x & (x > 0) \\\ 0 & (x \leq 0) \end{cases}

It is said that the ReLU function has simpler processing, faster learning, and better performance than the step and Sigmoid functions. Also, compared to the Sigmoid function, which is classically often used in neural networks, the vanishing gradient problem (probably coming out later) is less likely to occur, and differentiation is possible [^ 1], so it is now widely used. I will.

ReLU function


def relu(x):
    return np.maximum(0, x)

activation_fun.png

2. Output layer activation function

Neural networks are based on the Multi Layer Perceptron (MLP) that appeared at the end of the last time. The activation function seen in the previous chapter is used in the hidden layer inside the neural network. The final layer is special and another activation function $ \ sigma $ is often used. Neural networks can be used for both regression and classification problems, but the design of the output layer differs depending on the problem to be solved.

Regression problem

The identity function is used as the activation function of the output layer in the regression problem that predicts (continuous) numerical values ​​from the input data. The identity function, as the name implies, outputs the value as it is. That is,

x \rightarrow \sigma(x) \rightarrow x

It will be.

Classification problem

The Softmax function is used as the activation function of the output layer in the classification problem that classifies the input data into several classes. The Softmax function is expressed as follows.

y_k = \frac{\exp(a_k)}{\displaystyle\sum_{i=1}^n\exp(a_i)}

Since the denominator is the sum of all and the numerator is one of them, this formula can be seen as the probability of taking $ a_k $ out of $ a_i $. In other words, it can be interpreted as an image of classifying the input data into the highest class by finding the probability that it will be classified into each class. By the way, in implementation, the following equivalent function is used to prevent overflow.

y_k = \frac{\exp(a_k-C)}{\displaystyle\sum_{i=1}^n\exp(a_i-C)}, C=\max{a_i}

Since the Softmax function is an exponential calculation, it is converted to addition and subtraction even if the same constant is multiplied by the numerator denominator, and the result does not change.

3. Actually build a neural network

Now that we are ready to implement the neural network, let's implement it. Here, consider a three-layer neural network as shown in the figure below. It's just like MLP. network.png In order to consider the internal movement, we will explain the propagation from the 0th layer (input layer) to the 1st layer. Where $ x_n, a_m ^ {(k)}, z_m ^ {(k)}, b_m, w_ {m, n}, y_l $ are the $ n $ th initial input and the $ k $ layer $ m $, respectively. The input / output of the second neuron, the bias to the $ m $ th neuron, the weight from the $ n $ th output to the $ m $ th input, and the $ l $ th final output. 01network.png Input to the first layer $ \color{Crimson}{a_1^{(1)} = w_{11}^{(1)}x_1 + w_{12}^{(1)}x_2 + b_1^{(1)}},\\\ \color{RoyalBlue}{a_2^{(1)} = w_{21}^{(1)}x_1 + w_{22}^{(1)}x_2 + b_2^{(1)}},\\\ \color{ForestGreen}{a_3^{(1)} = w_{31}^{(1)}x_1 + w_{32}^{(1)}x_2 + b_3^{(1)}} $ It will be. If you convert this to matrix notation $ \boldsymbol{A}^{(1)} = \boldsymbol{XW}^{(1)} + \boldsymbol{B}^{(1)}, \\\ \boldsymbol{A}^{(1)} = \left[a_1^{(1)}~a_2^{(2)}~a_3^{(1)}\right], \\\ \boldsymbol{X} = \left[x_1^{(1)}~x_2^{(2)}\right], \\\ \boldsymbol{W}^{(1)} = \left[ \begin{array}{rrr} w_{11} ^{(1)} & w_{21}^{(1)} & w_{31}^{(1)}\\\ w_{12} ^{(1)} & w_{22}^{(1)} & w_{32}^{(1)} \end{array} \right], \\\ \boldsymbol{B}^{(1)} = \left[b_1^{(1)}~b_2^{(2)}~b_3^{(1)}\right] $ It will be. (Personally, if $ w_ {m, n} $ is defined like this, I think it would be better to write it in a transposed matrix, but it will continue as it is according to the book.) The input to the first layer $ \ boldsymbol {A} ^ {(1)} $ is converted to the first layer output $ \ boldsymbol {Z} ^ {(1)} $ through the activation function $ h . .. $ \boldsymbol{Z}^{(1)} = h\left(\boldsymbol{A}^{(1)}\right),\
\boldsymbol{Z}^{(1)} = \left[z_1^{(1)}~z_2^{(2)}~z_3^{(1)}\right] $ The rest is calculated in the same way as the 1st → 2nd layer and the 2nd → 3rd layer, but since the activation function is different for the output layer, only the output layer is as follows. $ \boldsymbol{Y} = \sigma\left(\boldsymbol{A}^{(3)}\right),\
\boldsymbol{Y} = \left[y_1~y_2~y_3\right] $$ Implement the formulas up to this point programmatically.

3-layer neural network


def init_network():
    network = {}
    network['W1'] = np.array([[0.1, 0.3, 0.5], [0.2, 0.4, 0.6]])
    network['b1'] = np.array([[0.1, 0.2, 0.3]])
    network['W2'] = np.array([[0.1, 0.4], [0.2, 0.5], [0.3, 0.6]])
    network['b2'] = np.array([[0.1, 0.2]])
    network['W3'] = np.array([[0.1, 0.3], [0.2, 0.4]])
    network['b3'] = np.array([[0.1, 0.2]])

    return network

def forward(network, x):
    W1, W2, W3 = network['W1'], network['W2'], network['W3']
    b1, b2, b3 = network['b1'], network['b2'], network['b3']

    a1 = np.dot(x, W1) + b1
    z1 = sigmoid(a1)
    a2 = np.dot(z1, W2) + b2
    z2 = sigmoid(a2)
    a3 = np.dot(z2, W3) + b3
    y = a3

    return y

network = init_network()
x = np.array([1.0, 0.5])
y = forward(network, x)
print(y)    #[[0.31682708 0.69627909]]

The first init_network () function defines the weight and bias values ​​for each layer. The following forward () function represents the matrix calculation in the network. In the main code, we give the weight, bias and input x to the forward function to calculate the output y.

You have now implemented a neural network. Next time, I will try the MNIST handwritten digit recognition problem as an example of using a neural network.

References

Deep-Learning from scratch Deep-Learning GitHub from scratch Deep Learning (Machine Learning Professional Series)

[^ 1]: Strictly speaking, the ReLU function is also mathematically indistinguishable at $ x = 0 $, but programmatically $ dh/dx = 1 ~ (x> 0), ~ 0 ~ (x \ leq0) ) $, And unlike the Sigmoid function, it seems to be good because it can express that there was a change.

Recommended Posts

Non-information graduate student studied machine learning from scratch # 2: Neural network
Non-information graduate student studied machine learning from scratch # 1: Perceptron
Lua version Deep Learning from scratch Part 6 [Neural network inference processing]
Python & Machine Learning Study Memo ③: Neural Network
[Python / Machine Learning] Why Deep Learning # 1 Perceptron Neural Network
Machine learning starting from scratch (machine learning learned with Kaggle)
Python vs Ruby "Deep Learning from scratch" Chapter 3 Implementation of 3-layer neural network
[Deep Learning from scratch] Initial value of neural network weight using sigmoid function
Implement Neural Network from 1
[Deep Learning] Execute SONY neural network console from CUI
Study method for learning machine learning from scratch (March 2020 version)
Create a machine learning environment from scratch with Winsows 10
[Deep Learning from scratch] Initial value of neural network weight when using Relu function
Deep Learning from scratch 1-3 chapters
Try to build a deep learning / neural network with scratch
Chapter 3 Neural Network Cut out only the good points of deep learning made from scratch
[Deep Learning from scratch] Implement backpropagation processing in neural network by error back propagation method
[Deep Learning from scratch] About the layers required to implement backpropagation processing in a neural network
[Deep Learning from scratch] Main parameter update methods for neural networks
Non-information graduate student studied machine learning from scratch # 1: Perceptron
Non-information graduate student studied machine learning from scratch # 2: Neural network
Machine learning starting from scratch (machine learning learned with Kaggle)
Machine learning starting from 0 for theoretical physics students # 2
Study method for learning machine learning from scratch (March 2020 version)
Create a machine learning environment from scratch with Winsows 10
Deep Learning from scratch 1-3 chapters
[Machine learning] Understanding uncorrelatedness from mathematics
Deep learning from scratch (cost calculation)
Deep Learning memos made from scratch
Python learning memo for machine learning by Chainer Chapter 13 Neural network training ~ Chainer completed
Deep learning from scratch (forward propagation edition)
Use machine learning APIs A3RT from Python
Implementation of 3-layer neural network (no learning)
Deep learning / Deep learning from scratch 2-Try moving GRU
Deep learning / Deep learning made from scratch Chapter 6 Memo
[Learning memo] Deep Learning made from scratch [Chapter 5]
[Learning memo] Deep Learning made from scratch [Chapter 6]
"Deep Learning from scratch" in Haskell (unfinished)
Deep learning / Deep learning made from scratch Chapter 7 Memo
[Windows 10] "Deep Learning from scratch" environment construction
Learning record of reading "Deep Learning from scratch"
[Deep Learning from scratch] About hyperparameter optimization
"Deep Learning from scratch" Self-study memo (Part 12) Deep learning
[Learning memo] Deep Learning made from scratch [~ Chapter 4]
[Deep Learning from scratch] Speeding up neural networks I explained back propagation processing