Learning memo Deep Learning made from scratch

https://www.amazon.co.jp/dp/4873117585/

Function memo that seems to be used

Sigmoid function: sigmoid

Important properties: returns between 0 and 1, smooth, monotonous (although not mentioned in the book)

`Sigmoid function`


h(x) = \frac{1}{1+\exp(-x)}

def sigmoid(x):
    return 1/(1+np.exp(-x))

Softmax function

After all, the maximum value is searched regardless of whether it is adapted or not, so it is common to omit the softmax function of the output layer.

`python`


y_k = \frac{\exp(a_k)}{\sum_{i=1}^{n}\exp(a_i)} = \frac{\exp(a_k + C')}{\sum_{i=1}^{n}\exp(a_i + C')}

def softmax(a) :
    c = np.max(a)
    exp_a = np.exp(a - c)
    sum_exp_a = np.sum(exp_a)
    y = exp_a / sum_exp_a
    return y

Loss function

Reasons to set the loss function When the recognition accuracy is indexed, the derivative of the parameter becomes 0 (stuck) in most places.

Sum of squares error: mean squared error

E = \frac{1}{2}\sum_{k=1} (y_k - t_k)^2

`python`


def mean_squared_error(y,t):
    return 0.5 * np.sum((y-t)**2)

Cross entropy error: cross entropy error

Point: One-hot expression: Only the correct label is 1, the others are 0 (label is t)

E=\sum_{k=1} - t_k \log y_k

`python`


def cross_entropy_error(y, t) :
    delta = le-7
    return -np.sum(t*np.log(y+delta))

Mini-batch compatible version: cross entropy error

Mini-batch (small chunk): Select a part from the data and use the part of the data as an "approximate" of the whole Point: In one-hot, the label of incorrect answer becomes 0 (= error is 0), so it can be ignored. Divide by N to get a unified index regardless of the number of training data

E=-\frac{1}{N}\sum_{n}\sum_{k=1} t_{nk} \log y_{nk}

`python`


def cross_entropy_error(y, t) :
    if y.ndim == 1:
        t = t.reshape(1, t.size)
        y = y.reshape(1, y.size)
        
    batch_size = y.shape[0]
    return -np.sum(t*np.log(y[np.arange(bathch_size), t])) / bathc_size

differential

Numerical differentiation

Point: Set to about 1e-4 so as not to cause rounding error

`python`


def numerical_diff(f, x) :
    h = 1e-4
    return (f(x+h)-f(x-h))/(2*h)

Partial differential

`python`


# x1=4 o'clock

def function_tmp1(x0):
    return x0*x0 + 4.0*2.0

numerical_diff(function_tmp1, 3.0)

Slope

Gradient: A vector of partial derivatives of all variables

`python`


def numerical_gradient(f, x):
    h = 1e-4 # 0.0001
    grad = np.zeros_like(x) #Generate an array with the same shape as x and fill it with values

    #The point is that the variables are differentiated one by one in order.
    for idx in range(x.size):
        tmp_val = x[idx]
        x[idx] = float(tmp_val) + h
        fxh1 = f(x) # f(x+h)
        
        x[idx] = tmp_val - h 
        fxh2 = f(x) # f(x-h)
        grad[idx] = (fxh1 - fxh2) / (2*h)
        
        x[idx] = tmp_val #Restore the value
        
    return grad

Gradient method

Gradient method: Repeat the movement in the gradient direction and gradually reduce the value of the function. Point: Gradient method reaches the minimum value, not the minimum value The image is easy to understand by Dr. Andrew Ng of Coursera Machine Learning Week 5 Lecture 9 p.31

x_0=x_0-\eta\frac{\partial f}{\partial x_0} \\
x_1=x_1-\eta\frac{\partial f}{\partial x_1} \\
\\
\eta :Learning rate (how much you learn in one learning, not too big or too small)

`python`


def gradient_descent(f, init_x, lr=0.01, step_num=100):
    x = init_x
    
    for i in range(step_num):
        grad = numerical_gradient(f, x)
        x -= lr * grad
        
    return x

def function_2(x):
    return x[0]**2 + x[1]**2

init_x = np.array([-3.0, 4.0])
gradient_descent(function_2, init_x=init_x, lr=0.1, step_num=100)

Parameters set by human hands such as the above learning rate are called hyperparameters.

Gradient with respect to neural network

W = \biggl(\begin{matrix}
w_{11} & w_{21} & w_{31} \\
w_{12} & w_{22} & w_{32} 
\end{matrix}\biggr)\\


\frac{\partial L}{\partial W} = \Biggl(\begin{matrix}
\frac{\partial L}{\partial w_{11}} & \frac{\partial L}{\partial w_{21}} & \frac{\partial L}{\partial w_{31}}\\
\frac{\partial L}{\partial w_{12}} & \frac{\partial L}{\partial w_{22}} & \frac{\partial L}{\partial w_{32}} 
\end{matrix}\Biggr)\\

\frac{\partial L}{\partial w_{11}} : w_{11}Represents how much the loss function L changes when

`python`


# coding: utf-8
import sys, os
sys.path.append(os.pardir)  #Settings for importing files in the parent directory
import numpy as np
from common.functions import softmax, cross_entropy_error
from common.gradient import numerical_gradient


class simpleNet:
    def __init__(self):
        self.W = np.random.randn(2,3)

    def predict(self, x):
        return np.dot(x, self.W)

    def loss(self, x, t):
        z = self.predict(x)
        y = softmax(z)
        loss = cross_entropy_error(y, t)

        return loss

`python`


#Try using
#Parameters
x = np.array([0.6, 0.9])
#label
t = np.array([0, 0, 1])

net = simpleNet()

f = lambda w: net.loss(x, t)
#In short, we are running the gradient method to find the one whose loss function is the minimum value.
dW = numerical_gradient(f, net.W)

print(dW)

[[ 0.10181684 0.35488728 -0.45670412] [ 0.15272526 0.53233092 -0.68505618]] The above result shows that increasing w_11 by h increases by 0.10181684. W_23 is the largest in terms of contribution

`python`


#Lambda expression
myfunc = lambda x: x ** 2 

myfunc(5)  # 25
myfunc(6)  # 36

#This is the same as below
def myfunc(x):
    return x ** 2

Implementation of learning algorithm

Learning overview

Neural network training: Adjusting weights and biases to adapt to training data

procedure

Step 1: Mini batch </ b> Randomly select some data from the training data. (Mini batch) The purpose is to reduce the value of the loss function of this mini-batch

Step 2: Gradient calculation </ b> Find the gradient of each weight parameter to reduce the loss function of the mini-batch. The gradient indicates the direction in which the value of the loss function is reduced most.

Step 3: Update parameters </ b> Update the weight parameter in the gradient direction by a small amount.

Step 4: Repeat </ b> Repeat steps 1-3

the term

Stochastic gradient descent (SGD): Probabilistic: "Probabilistically randomly selected" Gradient descent method *: "Find the minimum value"

There is also a gradient rising method, which is essentially the same problem if the sign is reversed, so it is not essentially demand.

Epoch: epoch 1 Epoch corresponds to the number of times all training data is used up in learning Example: Training data with 10,000 data, 100 mini-batch, repeat the stochastic gradient descent method 100 times

Implementation and description

python

# coding: utf-8 import sys, os sys.path.append(os.pardir) #Settings for importing files in the parent directory from common.functions import * from common.gradient import numerical_gradient class TwoLayerNet: #Initialization def __init__(self, input_size, hidden_size, output_size, weight_init_std=0.01): #Weight initialization self.params = {} self.params['W1'] = weight_init_std * np.random.randn(input_size, hidden_size) self.params['b1'] = np.zeros(hidden_size) self.params['W2'] = weight_init_std * np.random.randn(hidden_size, output_size) self.params['b2'] = np.zeros(output_size) #Perform recognition (inference). The argument x is the image data def predict(self, x): W1, W2 = self.params['W1'], self.params['W2'] b1, b2 = self.params['b1'], self.params['b2'] a1 = np.dot(x, W1) + b1 z1 = sigmoid(a1) a2 = np.dot(z1, W2) + b2 y = softmax(a2) return y #Find the loss function # x:Input data, t:Teacher data def loss(self, x, t): y = self.predict(x) return cross_entropy_error(y, t) #Find recognition accuracy def accuracy(self, x, t): y = self.predict(x) y = np.argmax(y, axis=1) t = np.argmax(t, axis=1) accuracy = np.sum(y == t) / float(x.shape[0]) return accuracy #Find the gradient for the weight parameter # x:Input data, t:Teacher data def numerical_gradient(self, x, t): loss_W = lambda W: self.loss(x, t) grads = {} grads['W1'] = numerical_gradient(loss_W, self.params['W1']) grads['b1'] = numerical_gradient(loss_W, self.params['b1']) grads['W2'] = numerical_gradient(loss_W, self.params['W2']) grads['b2'] = numerical_gradient(loss_W, self.params['b2']) return grads

Difficult to understand illustration Just doing this picture-like calculation at once with matrix calculation The picture is easier to understand in Coursera's Dr. Andrew Ng, Machine Learning Week 5, Lecture9 p.13.

Mini-batch learning, evaluation with test data

Omitted because it only improves accuracy by repeating the gradient method Evaluation with test data is also omitted because it only illustrates the accuracy of test data in order to judge whether it is overfitting.

Recommended Posts
[Learning memo] Deep Learning made from scratch [Chapter 7]

Deep learning / Deep learning made from scratch Chapter 6 Memo

[Learning memo] Deep Learning made from scratch [Chapter 5]

[Learning memo] Deep Learning made from scratch [Chapter 6]

Deep learning / Deep learning made from scratch Chapter 7 Memo

[Learning memo] Deep Learning made from scratch [~ Chapter 4]

Deep Learning from scratch Chapter 2 Perceptron (reading memo)

Deep Learning / Deep Learning from Zero 2 Chapter 4 Memo

Deep Learning / Deep Learning from Zero Chapter 3 Memo

Deep Learning / Deep Learning from Zero 2 Chapter 5 Memo

Deep Learning / Deep Learning from Zero 2 Chapter 7 Memo

Deep Learning / Deep Learning from Zero 2 Chapter 8 Memo

Deep Learning / Deep Learning from Zero Chapter 5 Memo

Deep Learning / Deep Learning from Zero Chapter 4 Memo

Deep Learning / Deep Learning from Zero 2 Chapter 3 Memo

Deep Learning memos made from scratch

Deep Learning / Deep Learning from Zero 2 Chapter 6 Memo

Deep Learning from scratch

"Deep Learning from scratch" self-study memo (unreadable glossary)

Deep Learning from scratch 1-3 chapters

"Deep Learning from scratch" Self-study memo (9) MultiLayerNet class

Deep Learning from scratch ① Chapter 6 "Techniques related to learning"

[Learning memo] Deep Learning from scratch ~ Implementation of Dropout ~

"Deep Learning from scratch" Self-study memo (10) MultiLayerNet class

"Deep Learning from scratch" Self-study memo (No. 11) CNN

"Deep Learning from scratch" Self-study memo (No. 19) Data Augmentation

"Deep Learning from scratch 2" Self-study memo (No. 21) Chapters 3 and 4

Application of Deep Learning 2 made from scratch Spam filter

Deep learning from scratch (cost calculation)

An amateur stumbled in Deep Learning from scratch Note: Chapter 1

Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 2

An amateur stumbled in Deep Learning from scratch Note: Chapter 3

An amateur stumbled in Deep Learning from scratch Note: Chapter 7

An amateur stumbled in Deep Learning from scratch Note: Chapter 5

Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 7

Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 1

"Deep Learning from scratch" self-study memo (No. 18) One! Meow! Grad-CAM!

"Deep Learning from scratch" self-study memo (No. 19-2) Data Augmentation continued

An amateur stumbled in Deep Learning from scratch Note: Chapter 4

An amateur stumbled in Deep Learning from scratch Note: Chapter 2

"Deep Learning from scratch" self-study memo (No. 15) TensorFlow beginner tutorial

Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 6

Deep learning from scratch (forward propagation edition)

Deep learning / Deep learning from scratch 2-Try moving GRU

"Deep Learning from scratch" in Haskell (unfinished)

[Windows 10] "Deep Learning from scratch" environment construction

Learning record of reading "Deep Learning from scratch"

[Deep Learning from scratch] About hyperparameter optimization

Write an impression of Deep Learning 3 framework edition made from scratch

"Deep Learning from scratch" self-study memo (No. 13) Try using Google Colaboratory

"Deep Learning from scratch" Self-study memo (No. 10-2) Initial value of weight

Chapter 3 Neural Network Cut out only the good points of deep learning made from scratch

Django memo # 1 from scratch

"Deep Learning from scratch" Self-study memo (No. 14) Run the program in Chapter 4 on Google Colaboratory

"Deep Learning from scratch" Self-study memo (Part 8) I drew the graph in Chapter 6 with matplotlib

Chapter 2 Implementation of Perceptron Cut out only the good points of deep learning made from scratch

Good book "Deep Learning from scratch" on GitHub

Python vs Ruby "Deep Learning from scratch" Chapter 2 Logic circuit by Perceptron

Python vs Ruby "Deep Learning from scratch" Chapter 4 Implementation of loss function

Chapter 1 Introduction to Python Cut out only the good points of deep learning made from scratch

[Deep Learning from scratch] I implemented the Affine layer