This is the material used when presenting at an in-house study session.
In order to make "good inference" in a neural network, you have to set the optimum parameters.
Neural network learning requires an index to know how well learning is working → ** Loss **
** Loss function ** is used to find the loss of the neural network
--Loss function --Square error (in Deep Learning 1 starting from zero) ―― @ohakutsu Will it be used for regression? --Cross entropy error --Often used for multi-class classification
In this section, the following layer structure is used to find the loss.
Put together the Softmax and Cross Entropy Error layers,
Softmax with Loss
What is Softmax? → ** Softmax function **
y_k = \frac {exp(s_k)}{\displaystyle \sum _{i=1}^{n} exp(s_i)}
What is Cross Entropy Error? → ** Cross entropy error **
L = - \sum_{k}t_k\space log\space y_k
Considering mini-batch processing
L = - \frac{1}{N} \sum_{n}\sum_{k}t_{nk}\space log\space y_{nk}
use
The goal of learning neural networks is to find parameters that minimize losses. What is important here is ** differentiation ** and ** gradient **.
differential → Amount of change at a certain moment @ohakutsu Introduction to Mathematics for AI (Artificial Intelligence) Starting from Junior High School Mathematics --YouTube
y = f(x)
The derivative of y with respect to x is
\frac{dy}{dx}
Can be expressed as
Differentiation can be obtained even if there are multiple variables With x as a vector
L = f(x)
\frac{\partial L}{\partial x} = \left( \frac{\partial L}{\partial x_1}, \frac{\partial L}{\partial x_2}, ..., \frac{\partial L}{\partial x_n} \right)
The sum of the derivatives of each element of the vector is called ** gradient **.
In the case of a matrix, the gradient can be considered in the same way. Let W be an m × n matrix
L = g(W)
\frac{\partial L}{\partial W} = \left(
\begin{array}{ccc}
\frac{\partial L}{\partial w_{11}} & \cdots & \frac{\partial L}{\partial w_{1n}} \\
\vdots & \ddots & \\
\frac{\partial L}{\partial w_{m1}} & & \frac{\partial L}{\partial w_{mn}}
\end{array}
\right)
The neural network at the time of training outputs the loss when the training data is given. Once the loss gradient for each parameter is obtained, it can be used to update the parameters.
How to find the gradient of a neural network → ** Error back propagation method **
The key to understanding the error backpropagation method is ** chain rule **
--Chain rules --The law of differentiation regarding the composition function
↓ Such a guy
y = f(x) \\
z = g(y) \\
Rewrite
z = g(f(x)) \\
The derivative of z with respect to x is
\frac{\partial z}{\partial x} = \frac{\partial z}{\partial y}\frac{\partial y}{\partial x}
No matter how complex a function is, its derivative can be determined by the derivative of an individual function.
A visual representation of the calculation
Example)
z = x + y
Reverse propagation is "back propagation"
Below, typical arithmetic nodes
--Addition node
--Multiplication node
--Branch node
--Repeat node
--Sum node
--MatMul node
Implement each layer
The sigmoid function is
y = \frac {1}{1 + exp(-x)}
The derivative of the sigmoid function is
\frac{\partial y}{\partial x} = y(1 - y)
The calculation graph of the Sigmoid layer is
When implemented in Python
class Sigmoid:
def __init__(self):
self.params, self.grads = [], []
self.out = None
def forward(self, x):
out = 1 / (1 + np.exp(-x))
self.out = out
return out
def backward(self, dout):
dx = dout * (1.0 - self.out) * self.out
return dx
Forward propagation of Affine layer
y = np.dot(x, W) + b
Bias addition is being broadcast
When implemented in Python
class Affine:
def __init__(self, W, b):
self.params = [W, b]
self.grads = [np.zeros_like(W), np.zeros_like(b)]
self.x = None
def forward(self, x):
W, b = self.params
out = np.dot(x, W) + b
self.x = x
return out
def backward(self, dout):
W, b = self.params
dx = np.dot(dout, W.T)
dW = np.dot(self.x.T, dout)
db = np.sum(dout, axis=0)
self.grads[0][...] = dW
self.grads[1][...] = db
return dx
class SoftmaxWithLoss:
def __init__(self):
self.params, self.grads = [], []
self.y = None #softmax output
self.t = None #Teacher label
def forward(self, x, t):
self.t = t
self.y = softmax(x)
#Teacher label is one-For hot vector, convert to correct index
if self.t.size == self.y.size:
self.t = self.t.argmax(axis=1)
loss = cross_entropy_error(self.y, self.t)
return loss
def backward(self, dout=1):
batch_size = self.t.shape[0]
dx = self.y.copy()
dx[np.arange(batch_size), self.t] -= 1
dx *= dout
dx = dx / batch_size
return dx
Update neural network parameters using the gradient obtained by the backpropagation method
To learn the neural network, follow the procedure below.
Update the parameters in the opposite direction of the gradient (direction to reduce the loss) using the gradient obtained in `2. Gradient calculation``. → ** Gradient descent method **
Here we use the simplest ** SGD ** method of updating weights (several other types I wrote in Deep Learning 1 starting from zero).
W \leftarrow W - \eta \frac{\partial L}{\partial W} \\
\eta :Learning coefficient
When implemented in Python
class SGD:
def __init__(self, lr=0.01):
self.lr = lr
def update(self, params, grads):
for i in range(len(params)):
params[i] -= self.lr * grads[i]
The actual neural network parameter update is as follows
model = TwoLayerNet( ... )
optimizer = SGD()
for i in range(10000):
...
x_batch, t_batch = get_mini_batch( ... ) #Get a mini batch
loss = model.forward(x_batch, t_batch)
model.backward()
optimizer.update(model.params, model.grads)
...
Actually learn neural network in 1.4
-O'Reilly Japan --Deep Learning from scratch ❷ -[oreilly-japan / deep-learning-from-scratch-2: "Deep Learning from scratch ❷" (O'Reilly Japan, 2018)](https://github.com/oreilly-japan/deep-learning- from-scratch-2)
Recommended Posts