What is this

This is the material used when presenting at an in-house study session.

1.3 Neural network learning

1.3.1 Loss function

In order to make "good inference" in a neural network, you have to set the optimum parameters.

Neural network learning requires an index to know how well learning is working → ** Loss **

** Loss function ** is used to find the loss of the neural network

--Loss function --Square error (in Deep Learning 1 starting from zero) ―― @ohakutsu Will it be used for regression? --Cross entropy error --Often used for multi-class classification

In this section, the following layer structure is used to find the loss.

Put together the Softmax and Cross Entropy Error layers,

Softmax with Loss

What is Softmax? → ** Softmax function **

y_k =  \frac {exp(s_k)}{\displaystyle \sum _{i=1}^{n} exp(s_i)}

Feature --Real numbers with output between 0.0 and 1.0 --Add all the outputs to get 1.0 --Can be interpreted as a probability

What is Cross Entropy Error? → ** Cross entropy error **

L = - \sum_{k}t_k\space log\space y_k

Feature --t is the teacher label of ʻone_hot expression` (0 or 1), so it just returns the natural logarithm when the label is 1. --The closer y is to 0, the smaller it becomes, and the closer it is to 1, the more it converges to 0.

Considering mini-batch processing

L = - \frac{1}{N} \sum_{n}\sum_{k}t_{nk}\space log\space y_{nk}

use

1.3.2 Derivatives and gradients

The goal of learning neural networks is to find parameters that minimize losses. What is important here is ** differentiation ** and ** gradient **.

differential → Amount of change at a certain moment @ohakutsu Introduction to Mathematics for AI (Artificial Intelligence) Starting from Junior High School Mathematics --YouTube

y = f(x)

The derivative of y with respect to x is

\frac{dy}{dx}

Can be expressed as

Differentiation can be obtained even if there are multiple variables With x as a vector

L = f(x)

\frac{\partial L}{\partial x} = \left( \frac{\partial L}{\partial x_1}, \frac{\partial L}{\partial x_2}, ..., \frac{\partial L}{\partial x_n} \right)

The sum of the derivatives of each element of the vector is called ** gradient **.

In the case of a matrix, the gradient can be considered in the same way. Let W be an m × n matrix

L = g(W)

\frac{\partial L}{\partial W} = \left(
  \begin{array}{ccc}
    \frac{\partial L}{\partial w_{11}} & \cdots & \frac{\partial L}{\partial w_{1n}} \\
    \vdots & \ddots & \\
    \frac{\partial L}{\partial w_{m1}} & & \frac{\partial L}{\partial w_{mn}}
  \end{array}
\right)

1.3.3 Chain rule

The neural network at the time of training outputs the loss when the training data is given. Once the loss gradient for each parameter is obtained, it can be used to update the parameters.

How to find the gradient of a neural network → ** Error back propagation method **

The key to understanding the error backpropagation method is ** chain rule **

--Chain rules --The law of differentiation regarding the composition function

↓ Such a guy

y = f(x) \\
z = g(y) \\

Rewrite

z = g(f(x)) \\

The derivative of z with respect to x is

\frac{\partial z}{\partial x} = \frac{\partial z}{\partial y}\frac{\partial y}{\partial x}

No matter how complex a function is, its derivative can be determined by the derivative of an individual function.

1.3.4 Calculation graph

A visual representation of the calculation

Example)

z = x + y

Reverse propagation is "back propagation"

Below, typical arithmetic nodes

--Addition node

--Multiplication node

--Branch node

--Repeat node

--Sum node

--MatMul node

1.3.5 Gradient derivation and backpropagation implementation

Implement each layer

Sigmoid layer

The sigmoid function is

y =  \frac {1}{1 + exp(-x)}

The derivative of the sigmoid function is

\frac{\partial y}{\partial x} = y(1 - y)

The calculation graph of the Sigmoid layer is

When implemented in Python

class Sigmoid:
  def __init__(self):
    self.params, self.grads = [], []
    self.out = None

  def forward(self, x):
    out = 1 / (1 + np.exp(-x))
    self.out = out
    return out

  def backward(self, dout):
    dx = dout * (1.0 - self.out) * self.out
    return dx

Affine layer

Forward propagation of Affine layer

y = np.dot(x, W) + b

Bias addition is being broadcast

When implemented in Python

class Affine:
  def __init__(self, W, b):
    self.params = [W, b]
    self.grads = [np.zeros_like(W), np.zeros_like(b)]
    self.x = None

  def forward(self, x):
    W, b = self.params
    out = np.dot(x, W) + b
    self.x = x
    return out

  def backward(self, dout):
    W, b = self.params
    dx = np.dot(dout, W.T)
    dW = np.dot(self.x.T, dout)
    db = np.sum(dout, axis=0)

    self.grads[0][...] = dW
    self.grads[1][...] = db
    return dx

Softmax with Loss layer

class SoftmaxWithLoss:
  def __init__(self):
    self.params, self.grads = [], []
    self.y = None  #softmax output
    self.t = None  #Teacher label

  def forward(self, x, t):
    self.t = t
    self.y = softmax(x)

    #Teacher label is one-For hot vector, convert to correct index
    if self.t.size == self.y.size:
      self.t = self.t.argmax(axis=1)

    loss = cross_entropy_error(self.y, self.t)
    return loss

  def backward(self, dout=1):
    batch_size = self.t.shape[0]

    dx = self.y.copy()
    dx[np.arange(batch_size), self.t] -= 1
    dx *= dout
    dx = dx / batch_size

    return dx

1.3.6 Weight update

Update neural network parameters using the gradient obtained by the backpropagation method

To learn the neural network, follow the procedure below.

Mini batch --If there is a lot of data, it will take time, so use a part of the data as an approximation of the whole (from Deep Learning 1 starting from zero)
Gradient calculation --Find the gradient of the loss function for each weight parameter using the backpropagation method.
Parameter update
Repeat steps 1 to 3

3. Parameter update

Update the parameters in the opposite direction of the gradient (direction to reduce the loss) using the gradient obtained in `2. Gradient calculation``. → ** Gradient descent method **

Here we use the simplest ** SGD ** method of updating weights (several other types I wrote in Deep Learning 1 starting from zero).

W \leftarrow W - \eta \frac{\partial L}{\partial W} \\
\eta :Learning coefficient

When implemented in Python

class SGD:
  def __init__(self, lr=0.01):
    self.lr = lr

  def update(self, params, grads):
    for i in range(len(params)):
      params[i] -= self.lr * grads[i]

The actual neural network parameter update is as follows

model = TwoLayerNet( ... )
optimizer = SGD()

for i in range(10000):
  ...
  x_batch, t_batch = get_mini_batch( ... ) #Get a mini batch
  loss = model.forward(x_batch, t_batch)
  model.backward()
  optimizer.update(model.params, model.grads)
  ...

Actually learn neural network in 1.4

The end

Link

-O'Reilly Japan --Deep Learning from scratch ❷ -[oreilly-japan / deep-learning-from-scratch-2: "Deep Learning from scratch ❷" (O'Reilly Japan, 2018)](https://github.com/oreilly-japan/deep-learning- from-scratch-2)

Deep Learning 2 Made from Zero Natural Language Processing 1.3 Summary

What is this

1.3 Neural network learning

1.3.1 Loss function

1.3.2 Derivatives and gradients

1.3.3 Chain rule

1.3.4 Calculation graph

1.3.5 Gradient derivation and backpropagation implementation

Sigmoid layer

Affine layer

Softmax with Loss layer

1.3.6 Weight update

3. Parameter update

The end

Link