Introduction

A non-information graduate student studied machine learning from scratch. Write in an article to keep a record of what you have studied. I will decide how to proceed while doing it, but for the time being, I will gradually step up from the basics while tracing the famous "Deep-Learning made from scratch". The environment will be operated by Google Colab. Part 2 is a neural network.

Activation function
Output layer activation function
Actually build a neural network

1. Activation function

In Part 1, the output of the biased perceptron was expressed as follows. $ y = \begin{cases} 0 & (w_1x_1+w_2x_2+b \leq 0) \\\ 1 & (w_1x_1+w_2x_2+b > 0) \end{cases} $ We extend this simply and generally by introducing the function $ h (x) . $ y = h(a) = \begin{cases} 0 & (a \leq 0) \
1 & (a > 0), \end{cases} \
a = w_1x_1+w_2x_2+b \
$$ This $ h (x) $ is called the activation function. As the name implies, it determines how the function is activated (fired). The above formula is an example of using a function in which 0 and 1 change discontinuously at the origin, which was also used in Part 1. Here are some functions that are commonly used for activation functions.

Step function

The function that switches the output depending on whether the input exceeds 0 or does not exceed 0, which has been explained so far, is called a step function. In other words

h(x) = \begin{cases} 0 & (x \leq 0) \\\ 1 & (x > 0) \end{cases}

It is represented by.

`Step function`


def step_function(x):
    return np.array(x > 0, dtype=np.int)

Sigmoid function

The Sigmoid function is one of the most commonly used activation functions for neural networks and is expressed by the following formula.

h(x) = \frac{1}{1+\exp(-x)}

Compared to the step function, the step function is a discontinuous function, so it becomes indistinguishable at $ x = 0 $. Also, the differential value is always 0 except for $ x = 0 $, which means that there is no change (actually, the value changes discontinuously, but it cannot be expressed). On the other hand, the Sigmoid function has the property that the output becomes 0 (approaches) when the input is small and becomes 1 (approaches) when it is large, but since it is a continuous function, it is differentiable and the value between 0 and 1 Can also be used continuously. Due to these characteristics, the Sigmoid function is used as the activation function of the neural network instead of the step function.

`Sigmoid function`


def sigmoid(x):
    return 1 / (1 + np.exp(-x))

ReLU function

The ReLU function is an activation function that has become popular in recent neural networks and is expressed by the following formula.

h(x) = \begin{cases} x & (x > 0) \\\ 0 & (x \leq 0) \end{cases}

It is said that the ReLU function has simpler processing, faster learning, and better performance than the step and Sigmoid functions. Also, compared to the Sigmoid function, which is classically often used in neural networks, the vanishing gradient problem (probably coming out later) is less likely to occur, and differentiation is possible [^ 1], so it is now widely used. I will.

`ReLU function`


def relu(x):
    return np.maximum(0, x)

2. Output layer activation function

Neural networks are based on the Multi Layer Perceptron (MLP) that appeared at the end of the last time. The activation function seen in the previous chapter is used in the hidden layer inside the neural network. The final layer is special and another activation function $ \ sigma $ is often used. Neural networks can be used for both regression and classification problems, but the design of the output layer differs depending on the problem to be solved.

Regression problem

The identity function is used as the activation function of the output layer in the regression problem that predicts (continuous) numerical values from the input data. The identity function, as the name implies, outputs the value as it is. That is,

x \rightarrow \sigma(x) \rightarrow x

It will be.

Classification problem

The Softmax function is used as the activation function of the output layer in the classification problem that classifies the input data into several classes. The Softmax function is expressed as follows.

y_k = \frac{\exp(a_k)}{\displaystyle\sum_{i=1}^n\exp(a_i)}

Since the denominator is the sum of all and the numerator is one of them, this formula can be seen as the probability of taking $ a_k $ out of $ a_i $. In other words, it can be interpreted as an image of classifying the input data into the highest class by finding the probability that it will be classified into each class. By the way, in implementation, the following equivalent function is used to prevent overflow.

y_k = \frac{\exp(a_k-C)}{\displaystyle\sum_{i=1}^n\exp(a_i-C)}, C=\max{a_i}

Since the Softmax function is an exponential calculation, it is converted to addition and subtraction even if the same constant is multiplied by the numerator denominator, and the result does not change.

3. Actually build a neural network

Now that we are ready to implement the neural network, let's implement it. Here, consider a three-layer neural network as shown in the figure below. It's just like MLP. In order to consider the internal movement, we will explain the propagation from the 0th layer (input layer) to the 1st layer. Where $ x_n, a_m ^ {(k)}, z_m ^ {(k)}, b_m, w_ {m, n}, y_l $ are the $ n $ th initial input and the $ k $ layer $ m $, respectively. The input / output of the second neuron, the bias to the $ m $ th neuron, the weight from the $ n $ th output to the $ m $ th input, and the $ l $ th final output. Input to the first layer $ \color{Crimson}{a_1^{(1)} = w_{11}^{(1)}x_1 + w_{12}^{(1)}x_2 + b_1^{(1)}},\\\ \color{RoyalBlue}{a_2^{(1)} = w_{21}^{(1)}x_1 + w_{22}^{(1)}x_2 + b_2^{(1)}},\\\ \color{ForestGreen}{a_3^{(1)} = w_{31}^{(1)}x_1 + w_{32}^{(1)}x_2 + b_3^{(1)}} $ It will be. If you convert this to matrix notation $ \boldsymbol{A}^{(1)} = \boldsymbol{XW}^{(1)} + \boldsymbol{B}^{(1)}, \\\ \boldsymbol{A}^{(1)} = \left[a_1^{(1)}~a_2^{(2)}~a_3^{(1)}\right], \\\ \boldsymbol{X} = \left[x_1^{(1)}~x_2^{(2)}\right], \\\ \boldsymbol{W}^{(1)} = \left[ \begin{array}{rrr} w_{11} ^{(1)} & w_{21}^{(1)} & w_{31}^{(1)}\\\ w_{12} ^{(1)} & w_{22}^{(1)} & w_{32}^{(1)} \end{array} \right], \\\ \boldsymbol{B}^{(1)} = \left[b_1^{(1)}~b_2^{(2)}~b_3^{(1)}\right] $ It will be. (Personally, if $ w_ {m, n} $ is defined like this, I think it would be better to write it in a transposed matrix, but it will continue as it is according to the book.) The input to the first layer $ \ boldsymbol {A} ^ {(1)} $ is converted to the first layer output $ \ boldsymbol {Z} ^ {(1)} $ through the activation function $ h . .. $ \boldsymbol{Z}^{(1)} = h\left(\boldsymbol{A}^{(1)}\right),\
\boldsymbol{Z}^{(1)} = \left[z_1^{(1)}~z_2^{(2)}~z_3^{(1)}\right] $ The rest is calculated in the same way as the 1st → 2nd layer and the 2nd → 3rd layer, but since the activation function is different for the output layer, only the output layer is as follows. $ \boldsymbol{Y} = \sigma\left(\boldsymbol{A}^{(3)}\right),\
\boldsymbol{Y} = \left[y_1~y_2~y_3\right] $$ Implement the formulas up to this point programmatically.

`3-layer neural network`


def init_network():
    network = {}
    network['W1'] = np.array([[0.1, 0.3, 0.5], [0.2, 0.4, 0.6]])
    network['b1'] = np.array([[0.1, 0.2, 0.3]])
    network['W2'] = np.array([[0.1, 0.4], [0.2, 0.5], [0.3, 0.6]])
    network['b2'] = np.array([[0.1, 0.2]])
    network['W3'] = np.array([[0.1, 0.3], [0.2, 0.4]])
    network['b3'] = np.array([[0.1, 0.2]])

    return network

def forward(network, x):
    W1, W2, W3 = network['W1'], network['W2'], network['W3']
    b1, b2, b3 = network['b1'], network['b2'], network['b3']

    a1 = np.dot(x, W1) + b1
    z1 = sigmoid(a1)
    a2 = np.dot(z1, W2) + b2
    z2 = sigmoid(a2)
    a3 = np.dot(z2, W3) + b3
    y = a3

    return y

network = init_network()
x = np.array([1.0, 0.5])
y = forward(network, x)
print(y)    #[[0.31682708 0.69627909]]

The first init_network () function defines the weight and bias values for each layer. The following forward () function represents the matrix calculation in the network. In the main code, we give the weight, bias and input x to the forward function to calculate the output y.

You have now implemented a neural network. Next time, I will try the MNIST handwritten digit recognition problem as an example of using a neural network.

References

Deep-Learning from scratch Deep-Learning GitHub from scratch Deep Learning (Machine Learning Professional Series)

[^ 1]: Strictly speaking, the ReLU function is also mathematically indistinguishable at $ x = 0 $, but programmatically $ dh/dx = 1 ~ (x> 0), ~ 0 ~ (x \ leq0) ) $, And unlike the Sigmoid function, it seems to be good because it can express that there was a change.

Non-information graduate student studied machine learning from scratch # 2: Neural network