[Learning memo] Deep Learning made from scratch [~ Chapter 4]

Learning memo Deep Learning made from scratch

https://www.amazon.co.jp/dp/4873117585/

Function memo that seems to be used

Sigmoid function: sigmoid

Important properties: returns between 0 and 1, smooth, monotonous (although not mentioned in the book)

Sigmoid function


h(x) = \frac{1}{1+\exp(-x)} 
def sigmoid(x):
    return 1/(1+np.exp(-x))

sigmoid.png

Softmax function

After all, the maximum value is searched regardless of whether it is adapted or not, so it is common to omit the softmax function of the output layer.

python


y_k = \frac{\exp(a_k)}{\sum_{i=1}^{n}\exp(a_i)} = \frac{\exp(a_k + C')}{\sum_{i=1}^{n}\exp(a_i + C')} 
def softmax(a) :
    c = np.max(a)
    exp_a = np.exp(a - c)
    sum_exp_a = np.sum(exp_a)
    y = exp_a / sum_exp_a
    return y

Loss function

Reasons to set the loss function When the recognition accuracy is indexed, the derivative of the parameter becomes 0 (stuck) in most places.

Sum of squares error: mean squared error

E = \frac{1}{2}\sum_{k=1} (y_k - t_k)^2

python


def mean_squared_error(y,t):
    return 0.5 * np.sum((y-t)**2)

Cross entropy error: cross entropy error

Point: One-hot expression: Only the correct label is 1, the others are 0 (label is t)

E=\sum_{k=1} - t_k \log y_k 

python


def cross_entropy_error(y, t) :
    delta = le-7
    return -np.sum(t*np.log(y+delta))

Mini-batch compatible version: cross entropy error

Mini-batch (small chunk): Select a part from the data and use the part of the data as an "approximate" of the whole Point: In one-hot, the label of incorrect answer becomes 0 (= error is 0), so it can be ignored. Divide by N to get a unified index regardless of the number of training data

E=-\frac{1}{N}\sum_{n}\sum_{k=1} t_{nk} \log y_{nk} 

python


def cross_entropy_error(y, t) :
    if y.ndim == 1:
        t = t.reshape(1, t.size)
        y = y.reshape(1, y.size)
        
    batch_size = y.shape[0]
    return -np.sum(t*np.log(y[np.arange(bathch_size), t])) / bathc_size

differential

Numerical differentiation

Point: Set to about 1e-4 so as not to cause rounding error

python


def numerical_diff(f, x) :
    h = 1e-4
    return (f(x+h)-f(x-h))/(2*h)

Partial differential

python


# x1=4 o'clock

def function_tmp1(x0):
    return x0*x0 + 4.0*2.0

numerical_diff(function_tmp1, 3.0)

Slope

Gradient: A vector of partial derivatives of all variables

python


def numerical_gradient(f, x):
    h = 1e-4 # 0.0001
    grad = np.zeros_like(x) #Generate an array with the same shape as x and fill it with values

    #The point is that the variables are differentiated one by one in order.
    for idx in range(x.size):
        tmp_val = x[idx]
        x[idx] = float(tmp_val) + h
        fxh1 = f(x) # f(x+h)
        
        x[idx] = tmp_val - h 
        fxh2 = f(x) # f(x-h)
        grad[idx] = (fxh1 - fxh2) / (2*h)
        
        x[idx] = tmp_val #Restore the value
        
    return grad

Gradient method

Gradient method: Repeat the movement in the gradient direction and gradually reduce the value of the function. Point: Gradient method reaches the minimum value, not the minimum value The image is easy to understand by Dr. Andrew Ng of Coursera Machine Learning Week 5 Lecture 9 p.31

x_0=x_0-\eta\frac{\partial f}{\partial x_0} \\
x_1=x_1-\eta\frac{\partial f}{\partial x_1} \\
\\
\eta :Learning rate (how much you learn in one learning, not too big or too small)

python


def gradient_descent(f, init_x, lr=0.01, step_num=100):
    x = init_x
    
    for i in range(step_num):
        grad = numerical_gradient(f, x)
        x -= lr * grad
        
    return x

def function_2(x):
    return x[0]**2 + x[1]**2

init_x = np.array([-3.0, 4.0])
gradient_descent(function_2, init_x=init_x, lr=0.1, step_num=100)

Parameters set by human hands such as the above learning rate are called hyperparameters.

Gradient with respect to neural network

W = \biggl(\begin{matrix}
w_{11} & w_{21} & w_{31} \\
w_{12} & w_{22} & w_{32} 
\end{matrix}\biggr)\\


\frac{\partial L}{\partial W} = \Biggl(\begin{matrix}
\frac{\partial L}{\partial w_{11}} & \frac{\partial L}{\partial w_{21}} & \frac{\partial L}{\partial w_{31}}\\
\frac{\partial L}{\partial w_{12}} & \frac{\partial L}{\partial w_{22}} & \frac{\partial L}{\partial w_{32}} 
\end{matrix}\Biggr)\\

\frac{\partial L}{\partial w_{11}} : w_{11}Represents how much the loss function L changes when

python


# coding: utf-8
import sys, os
sys.path.append(os.pardir)  #Settings for importing files in the parent directory
import numpy as np
from common.functions import softmax, cross_entropy_error
from common.gradient import numerical_gradient


class simpleNet:
    def __init__(self):
        self.W = np.random.randn(2,3)

    def predict(self, x):
        return np.dot(x, self.W)

    def loss(self, x, t):
        z = self.predict(x)
        y = softmax(z)
        loss = cross_entropy_error(y, t)

        return loss

python


#Try using
#Parameters
x = np.array([0.6, 0.9])
#label
t = np.array([0, 0, 1])

net = simpleNet()

f = lambda w: net.loss(x, t)
#In short, we are running the gradient method to find the one whose loss function is the minimum value.
dW = numerical_gradient(f, net.W)

print(dW)

[[ 0.10181684 0.35488728 -0.45670412] [ 0.15272526 0.53233092 -0.68505618]] The above result shows that increasing w_11 by h increases by 0.10181684. W_23 is the largest in terms of contribution

python


#Lambda expression
myfunc = lambda x: x ** 2 

myfunc(5)  # 25
myfunc(6)  # 36

#This is the same as below
def myfunc(x):
    return x ** 2

Implementation of learning algorithm

Learning overview

Neural network training: Adjusting weights and biases to adapt to training data

procedure

Step 1: Mini batch </ b> Randomly select some data from the training data. (Mini batch) The purpose is to reduce the value of the loss function of this mini-batch

Step 2: Gradient calculation </ b> Find the gradient of each weight parameter to reduce the loss function of the mini-batch. The gradient indicates the direction in which the value of the loss function is reduced most.

Step 3: Update parameters </ b> Update the weight parameter in the gradient direction by a small amount.

Step 4: Repeat </ b> Repeat steps 1-3

the term

Stochastic gradient descent (SGD): Probabilistic: "Probabilistically randomly selected" Gradient descent method *: "Find the minimum value"

  • There is also a gradient rising method, which is essentially the same problem if the sign is reversed, so it is not essentially demand.

Epoch: epoch 1 Epoch corresponds to the number of times all training data is used up in learning Example: Training data with 10,000 data, 100 mini-batch, repeat the stochastic gradient descent method 100 times

Implementation and description

python


# coding: utf-8
import sys, os
sys.path.append(os.pardir)  #Settings for importing files in the parent directory
from common.functions import *
from common.gradient import numerical_gradient


class TwoLayerNet:

    #Initialization
    def __init__(self, input_size, hidden_size, output_size, weight_init_std=0.01):
        #Weight initialization
        self.params = {}
        self.params['W1'] = weight_init_std * np.random.randn(input_size, hidden_size)
        self.params['b1'] = np.zeros(hidden_size)
        self.params['W2'] = weight_init_std * np.random.randn(hidden_size, output_size)
        self.params['b2'] = np.zeros(output_size)

    #Perform recognition (inference). The argument x is the image data
    def predict(self, x):
        W1, W2 = self.params['W1'], self.params['W2']
        b1, b2 = self.params['b1'], self.params['b2']
    
        a1 = np.dot(x, W1) + b1
        z1 = sigmoid(a1)
        a2 = np.dot(z1, W2) + b2
        y = softmax(a2)
        
        return y

    #Find the loss function
    # x:Input data, t:Teacher data
    def loss(self, x, t):
        y = self.predict(x)
        
        return cross_entropy_error(y, t)
    
    #Find recognition accuracy
    def accuracy(self, x, t):
        y = self.predict(x)
        y = np.argmax(y, axis=1)
        t = np.argmax(t, axis=1)
        
        accuracy = np.sum(y == t) / float(x.shape[0])
        return accuracy
        
    #Find the gradient for the weight parameter
    # x:Input data, t:Teacher data
    def numerical_gradient(self, x, t):
        loss_W = lambda W: self.loss(x, t)
        
        grads = {}
        grads['W1'] = numerical_gradient(loss_W, self.params['W1'])
        grads['b1'] = numerical_gradient(loss_W, self.params['b1'])
        grads['W2'] = numerical_gradient(loss_W, self.params['W2'])
        grads['b2'] = numerical_gradient(loss_W, self.params['b2'])
        
        return grads

Difficult to understand illustration Just doing this picture-like calculation at once with matrix calculation The picture is easier to understand in Coursera's Dr. Andrew Ng, Machine Learning Week 5, Lecture9 p.13. memo.png

Mini-batch learning, evaluation with test data

Omitted because it only improves accuracy by repeating the gradient method Evaluation with test data is also omitted because it only illustrates the accuracy of test data in order to judge whether it is overfitting.

Recommended Posts