https://www.amazon.co.jp/dp/4873117585/
Important properties: returns between 0 and 1, smooth, monotonous (although not mentioned in the book)
Sigmoid function
h(x) = \frac{1}{1+\exp(-x)}
def sigmoid(x):
return 1/(1+np.exp(-x))
After all, the maximum value is searched regardless of whether it is adapted or not, so it is common to omit the softmax function of the output layer.
python
y_k = \frac{\exp(a_k)}{\sum_{i=1}^{n}\exp(a_i)} = \frac{\exp(a_k + C')}{\sum_{i=1}^{n}\exp(a_i + C')}
def softmax(a) :
c = np.max(a)
exp_a = np.exp(a - c)
sum_exp_a = np.sum(exp_a)
y = exp_a / sum_exp_a
return y
Reasons to set the loss function When the recognition accuracy is indexed, the derivative of the parameter becomes 0 (stuck) in most places.
E = \frac{1}{2}\sum_{k=1} (y_k - t_k)^2
python
def mean_squared_error(y,t):
return 0.5 * np.sum((y-t)**2)
Point: One-hot expression: Only the correct label is 1, the others are 0 (label is t)
E=\sum_{k=1} - t_k \log y_k
python
def cross_entropy_error(y, t) :
delta = le-7
return -np.sum(t*np.log(y+delta))
Mini-batch (small chunk): Select a part from the data and use the part of the data as an "approximate" of the whole Point: In one-hot, the label of incorrect answer becomes 0 (= error is 0), so it can be ignored. Divide by N to get a unified index regardless of the number of training data
E=-\frac{1}{N}\sum_{n}\sum_{k=1} t_{nk} \log y_{nk}
python
def cross_entropy_error(y, t) :
if y.ndim == 1:
t = t.reshape(1, t.size)
y = y.reshape(1, y.size)
batch_size = y.shape[0]
return -np.sum(t*np.log(y[np.arange(bathch_size), t])) / bathc_size
Point: Set to about 1e-4 so as not to cause rounding error
python
def numerical_diff(f, x) :
h = 1e-4
return (f(x+h)-f(x-h))/(2*h)
python
# x1=4 o'clock
def function_tmp1(x0):
return x0*x0 + 4.0*2.0
numerical_diff(function_tmp1, 3.0)
Gradient: A vector of partial derivatives of all variables
python
def numerical_gradient(f, x):
h = 1e-4 # 0.0001
grad = np.zeros_like(x) #Generate an array with the same shape as x and fill it with values
#The point is that the variables are differentiated one by one in order.
for idx in range(x.size):
tmp_val = x[idx]
x[idx] = float(tmp_val) + h
fxh1 = f(x) # f(x+h)
x[idx] = tmp_val - h
fxh2 = f(x) # f(x-h)
grad[idx] = (fxh1 - fxh2) / (2*h)
x[idx] = tmp_val #Restore the value
return grad
Gradient method: Repeat the movement in the gradient direction and gradually reduce the value of the function. Point: Gradient method reaches the minimum value, not the minimum value The image is easy to understand by Dr. Andrew Ng of Coursera Machine Learning Week 5 Lecture 9 p.31
x_0=x_0-\eta\frac{\partial f}{\partial x_0} \\
x_1=x_1-\eta\frac{\partial f}{\partial x_1} \\
\\
\eta :Learning rate (how much you learn in one learning, not too big or too small)
python
def gradient_descent(f, init_x, lr=0.01, step_num=100):
x = init_x
for i in range(step_num):
grad = numerical_gradient(f, x)
x -= lr * grad
return x
def function_2(x):
return x[0]**2 + x[1]**2
init_x = np.array([-3.0, 4.0])
gradient_descent(function_2, init_x=init_x, lr=0.1, step_num=100)
Parameters set by human hands such as the above learning rate are called hyperparameters.
W = \biggl(\begin{matrix}
w_{11} & w_{21} & w_{31} \\
w_{12} & w_{22} & w_{32}
\end{matrix}\biggr)\\
\frac{\partial L}{\partial W} = \Biggl(\begin{matrix}
\frac{\partial L}{\partial w_{11}} & \frac{\partial L}{\partial w_{21}} & \frac{\partial L}{\partial w_{31}}\\
\frac{\partial L}{\partial w_{12}} & \frac{\partial L}{\partial w_{22}} & \frac{\partial L}{\partial w_{32}}
\end{matrix}\Biggr)\\
\frac{\partial L}{\partial w_{11}} : w_{11}Represents how much the loss function L changes when
python
# coding: utf-8
import sys, os
sys.path.append(os.pardir) #Settings for importing files in the parent directory
import numpy as np
from common.functions import softmax, cross_entropy_error
from common.gradient import numerical_gradient
class simpleNet:
def __init__(self):
self.W = np.random.randn(2,3)
def predict(self, x):
return np.dot(x, self.W)
def loss(self, x, t):
z = self.predict(x)
y = softmax(z)
loss = cross_entropy_error(y, t)
return loss
python
#Try using
#Parameters
x = np.array([0.6, 0.9])
#label
t = np.array([0, 0, 1])
net = simpleNet()
f = lambda w: net.loss(x, t)
#In short, we are running the gradient method to find the one whose loss function is the minimum value.
dW = numerical_gradient(f, net.W)
print(dW)
[[ 0.10181684 0.35488728 -0.45670412] [ 0.15272526 0.53233092 -0.68505618]] The above result shows that increasing w_11 by h increases by 0.10181684. W_23 is the largest in terms of contribution
python
#Lambda expression
myfunc = lambda x: x ** 2
myfunc(5) # 25
myfunc(6) # 36
#This is the same as below
def myfunc(x):
return x ** 2
Neural network training: Adjusting weights and biases to adapt to training data
Step 1: Mini batch </ b> Randomly select some data from the training data. (Mini batch) The purpose is to reduce the value of the loss function of this mini-batch
Step 2: Gradient calculation </ b> Find the gradient of each weight parameter to reduce the loss function of the mini-batch. The gradient indicates the direction in which the value of the loss function is reduced most.
Step 3: Update parameters </ b> Update the weight parameter in the gradient direction by a small amount.
Step 4: Repeat </ b> Repeat steps 1-3
Stochastic gradient descent (SGD): Probabilistic: "Probabilistically randomly selected" Gradient descent method *: "Find the minimum value"
Epoch: epoch 1 Epoch corresponds to the number of times all training data is used up in learning Example: Training data with 10,000 data, 100 mini-batch, repeat the stochastic gradient descent method 100 times
python
# coding: utf-8
import sys, os
sys.path.append(os.pardir) #Settings for importing files in the parent directory
from common.functions import *
from common.gradient import numerical_gradient
class TwoLayerNet:
#Initialization
def __init__(self, input_size, hidden_size, output_size, weight_init_std=0.01):
#Weight initialization
self.params = {}
self.params['W1'] = weight_init_std * np.random.randn(input_size, hidden_size)
self.params['b1'] = np.zeros(hidden_size)
self.params['W2'] = weight_init_std * np.random.randn(hidden_size, output_size)
self.params['b2'] = np.zeros(output_size)
#Perform recognition (inference). The argument x is the image data
def predict(self, x):
W1, W2 = self.params['W1'], self.params['W2']
b1, b2 = self.params['b1'], self.params['b2']
a1 = np.dot(x, W1) + b1
z1 = sigmoid(a1)
a2 = np.dot(z1, W2) + b2
y = softmax(a2)
return y
#Find the loss function
# x:Input data, t:Teacher data
def loss(self, x, t):
y = self.predict(x)
return cross_entropy_error(y, t)
#Find recognition accuracy
def accuracy(self, x, t):
y = self.predict(x)
y = np.argmax(y, axis=1)
t = np.argmax(t, axis=1)
accuracy = np.sum(y == t) / float(x.shape[0])
return accuracy
#Find the gradient for the weight parameter
# x:Input data, t:Teacher data
def numerical_gradient(self, x, t):
loss_W = lambda W: self.loss(x, t)
grads = {}
grads['W1'] = numerical_gradient(loss_W, self.params['W1'])
grads['b1'] = numerical_gradient(loss_W, self.params['b1'])
grads['W2'] = numerical_gradient(loss_W, self.params['W2'])
grads['b2'] = numerical_gradient(loss_W, self.params['b2'])
return grads
Difficult to understand illustration Just doing this picture-like calculation at once with matrix calculation The picture is easier to understand in Coursera's Dr. Andrew Ng, Machine Learning Week 5, Lecture9 p.13.
Omitted because it only improves accuracy by repeating the gradient method Evaluation with test data is also omitted because it only illustrates the accuracy of test data in order to judge whether it is overfitting.
Recommended Posts