Deep Learning 2 Made from Zero Natural Language Processing 1.3 Summary

What is this

This is the material used when presenting at an in-house study session.


1.3 Neural network learning


1.3.1 Loss function

In order to make "good inference" in a neural network, you have to set the optimum parameters.

Neural network learning requires an index to know how well learning is working → ** Loss **

** Loss function ** is used to find the loss of the neural network


--Loss function --Square error (in Deep Learning 1 starting from zero) ―― @ohakutsu Will it be used for regression? --Cross entropy error --Often used for multi-class classification


In this section, the following layer structure is used to find the loss.

1e49abb0-2cb2-dd5d-d881-5ee1cf946b46.png


Put together the Softmax and Cross Entropy Error layers,

bd2831b7-351e-9951-be8e-115321114a48.png

Softmax with Loss


What is Softmax? → ** Softmax function **

y_k =  \frac {exp(s_k)}{\displaystyle \sum _{i=1}^{n} exp(s_i)}

What is Cross Entropy Error? → ** Cross entropy error **

L = - \sum_{k}t_k\space log\space y_k

logx.png


Considering mini-batch processing

L = - \frac{1}{N} \sum_{n}\sum_{k}t_{nk}\space log\space y_{nk}

use


1.3.2 Derivatives and gradients

The goal of learning neural networks is to find parameters that minimize losses. What is important here is ** differentiation ** and ** gradient **.


differential → Amount of change at a certain moment @ohakutsu Introduction to Mathematics for AI (Artificial Intelligence) Starting from Junior High School Mathematics --YouTube

y = f(x)

The derivative of y with respect to x is

\frac{dy}{dx}

Can be expressed as


Differentiation can be obtained even if there are multiple variables With x as a vector

L = f(x)
\frac{\partial L}{\partial x} = \left( \frac{\partial L}{\partial x_1}, \frac{\partial L}{\partial x_2}, ..., \frac{\partial L}{\partial x_n} \right)

The sum of the derivatives of each element of the vector is called ** gradient **.

In the case of a matrix, the gradient can be considered in the same way. Let W be an m × n matrix

L = g(W)
\frac{\partial L}{\partial W} = \left(
  \begin{array}{ccc}
    \frac{\partial L}{\partial w_{11}} & \cdots & \frac{\partial L}{\partial w_{1n}} \\
    \vdots & \ddots & \\
    \frac{\partial L}{\partial w_{m1}} & & \frac{\partial L}{\partial w_{mn}}
  \end{array}
\right)

1.3.3 Chain rule

The neural network at the time of training outputs the loss when the training data is given. Once the loss gradient for each parameter is obtained, it can be used to update the parameters.

How to find the gradient of a neural network → ** Error back propagation method **

The key to understanding the error backpropagation method is ** chain rule **


--Chain rules --The law of differentiation regarding the composition function

↓ Such a guy

y = f(x) \\
z = g(y) \\

Rewrite

z = g(f(x)) \\

The derivative of z with respect to x is

\frac{\partial z}{\partial x} = \frac{\partial z}{\partial y}\frac{\partial y}{\partial x}

No matter how complex a function is, its derivative can be determined by the derivative of an individual function.


1.3.4 Calculation graph

A visual representation of the calculation

Example)

z = x + y

097c0e7d-5245-f3a6-7445-c9e8b198efd7.png e22cfed3-f846-7607-fb78-0c8299b16700.png

Reverse propagation is "back propagation"

6981c2fe-1898-d29b-0ba1-aee35c912493.png


Below, typical arithmetic nodes

--Addition node 30e860a6-3518-5221-b139-f82bb44febe2.png

--Multiplication node 951e8bee-fef4-df5b-d9a2-d7de611e25bd.png

--Branch node e4686bbe-ad63-968c-c415-7ee499f24a39.png

--Repeat node 03e79e0c-7f3e-bbf4-b838-232a872dc3eb.png

--Sum node 99fe71b2-03f3-1de4-0a7a-5c416584f6ee.png

--MatMul node 744aada9-deea-379a-a8de-eb76267e1066.png


1.3.5 Gradient derivation and backpropagation implementation

Implement each layer

bd2831b7-351e-9951-be8e-115321114a48.png


Sigmoid layer

The sigmoid function is

y =  \frac {1}{1 + exp(-x)}

The derivative of the sigmoid function is

\frac{\partial y}{\partial x} = y(1 - y)

The calculation graph of the Sigmoid layer is

c725a5fc-e5b6-c377-ff33-df7bc4451a17.png


When implemented in Python

class Sigmoid:
  def __init__(self):
    self.params, self.grads = [], []
    self.out = None

  def forward(self, x):
    out = 1 / (1 + np.exp(-x))
    self.out = out
    return out

  def backward(self, dout):
    dx = dout * (1.0 - self.out) * self.out
    return dx

Affine layer

Forward propagation of Affine layer

y = np.dot(x, W) + b

Bias addition is being broadcast

df1bab77-9c16-6624-2670-7e30a02b4abd.png


When implemented in Python

class Affine:
  def __init__(self, W, b):
    self.params = [W, b]
    self.grads = [np.zeros_like(W), np.zeros_like(b)]
    self.x = None

  def forward(self, x):
    W, b = self.params
    out = np.dot(x, W) + b
    self.x = x
    return out

  def backward(self, dout):
    W, b = self.params
    dx = np.dot(dout, W.T)
    dW = np.dot(self.x.T, dout)
    db = np.sum(dout, axis=0)

    self.grads[0][...] = dW
    self.grads[1][...] = db
    return dx

Softmax with Loss layer

f9240a28-a5d5-0b1f-03aa-e22edfed5c13.png

class SoftmaxWithLoss:
  def __init__(self):
    self.params, self.grads = [], []
    self.y = None  #softmax output
    self.t = None  #Teacher label

  def forward(self, x, t):
    self.t = t
    self.y = softmax(x)

    #Teacher label is one-For hot vector, convert to correct index
    if self.t.size == self.y.size:
      self.t = self.t.argmax(axis=1)

    loss = cross_entropy_error(self.y, self.t)
    return loss

  def backward(self, dout=1):
    batch_size = self.t.shape[0]

    dx = self.y.copy()
    dx[np.arange(batch_size), self.t] -= 1
    dx *= dout
    dx = dx / batch_size

    return dx

1.3.6 Weight update

Update neural network parameters using the gradient obtained by the backpropagation method

To learn the neural network, follow the procedure below.

  1. Mini batch --If there is a lot of data, it will take time, so use a part of the data as an approximation of the whole (from Deep Learning 1 starting from zero)
  2. Gradient calculation --Find the gradient of the loss function for each weight parameter using the backpropagation method.
  3. Parameter update
  4. Repeat steps 1 to 3

3. Parameter update

Update the parameters in the opposite direction of the gradient (direction to reduce the loss) using the gradient obtained in `2. Gradient calculation``. → ** Gradient descent method **

sample.png


Here we use the simplest ** SGD ** method of updating weights (several other types I wrote in Deep Learning 1 starting from zero).

W \leftarrow W - \eta \frac{\partial L}{\partial W} \\
\eta :Learning coefficient

When implemented in Python

class SGD:
  def __init__(self, lr=0.01):
    self.lr = lr

  def update(self, params, grads):
    for i in range(len(params)):
      params[i] -= self.lr * grads[i]

The actual neural network parameter update is as follows

model = TwoLayerNet( ... )
optimizer = SGD()

for i in range(10000):
  ...
  x_batch, t_batch = get_mini_batch( ... ) #Get a mini batch
  loss = model.forward(x_batch, t_batch)
  model.backward()
  optimizer.update(model.params, model.grads)
  ...

Actually learn neural network in 1.4


The end


Link

-O'Reilly Japan --Deep Learning from scratch ❷ -[oreilly-japan / deep-learning-from-scratch-2: "Deep Learning from scratch ❷" (O'Reilly Japan, 2018)](https://github.com/oreilly-japan/deep-learning- from-scratch-2)

Recommended Posts

Deep Learning 2 Made from Zero Natural Language Processing 1.3 Summary
[Python] [Natural language processing] I tried Deep Learning ❷ made from scratch in Japanese ①
Python: Deep Learning in Natural Language Processing: Basics
Deep Learning / Deep Learning from Zero 2 Chapter 4 Memo
Deep Learning / Deep Learning from Zero Chapter 3 Memo
Deep Learning / Deep Learning from Zero 2 Chapter 5 Memo
Deep Learning / Deep Learning from Zero 2 Chapter 7 Memo
Deep Learning / Deep Learning from Zero 2 Chapter 8 Memo
Deep Learning / Deep Learning from Zero Chapter 5 Memo
Deep Learning / Deep Learning from Zero Chapter 4 Memo
Deep Learning / Deep Learning from Zero 2 Chapter 3 Memo
Deep Learning memos made from scratch
Natural language processing analyzer installation summary
Deep Learning / Deep Learning from Zero 2 Chapter 6 Memo
Reinforcement learning to learn from zero to deep
[Learning memo] Deep Learning made from scratch [Chapter 7]
Deep learning / Deep learning made from scratch Chapter 6 Memo
[Learning memo] Deep Learning made from scratch [Chapter 5]
[Learning memo] Deep Learning made from scratch [Chapter 6]
Deep learning / Deep learning made from scratch Chapter 7 Memo
[Learning memo] Deep Learning made from scratch [~ Chapter 4]
Python vs Ruby "Deep Learning from scratch" Summary
Python: Natural language processing
Deep Learning from scratch
RNN_LSTM2 Natural language processing
Python: Deep learning in natural language processing: Implementation of answer sentence selection system
Application of Deep Learning 2 made from scratch Spam filter
Deep Learning from scratch 1-3 chapters
Natural language processing 1 Morphological analysis
Natural language processing 3 Word continuity
Natural language processing 2 Word similarity
100 Amateur Language Processing Knock: Summary
Study natural language processing with Kikagaku
100 natural language processing knocks Chapter 4 Commentary
[Language processing 100 knocks 2020] Chapter 6: Machine learning
Natural language processing for busy people
Lua version Deep Learning from scratch Part 6 [Neural network inference processing]
[Natural language processing] Preprocessing with Japanese
Deep learning from scratch (cost calculation)
100 Language Processing Knock 2020 Chapter 6: Machine Learning
Write an impression of Deep Learning 3 framework edition made from scratch
100 language processing knock-73 (using scikit-learn): learning
Preparing to start natural language processing
Summary of multi-process processing of script language
Deep learning tutorial from environment construction
Voice processing by deep learning: Let's identify who the voice actor is from the voice
[Natural language processing] Extract keywords from Kakenhi database with MeCab-ipadic-neologd and termextract
Summary Note on Deep Learning -4.2 Loss Function-
Deep learning from scratch (forward propagation edition)
Natural language processing of Yu-Gi-Oh! Card name-Yu-Gi-Oh!
100 Knocking Natural Language Processing Chapter 1 (Preparatory Movement)
Deep learning / Deep learning from scratch 2-Try moving GRU
3. Natural language processing with Python 2-1. Co-occurrence network
[WIP] Pre-processing memo in natural language processing
3. Natural language processing with Python 1-1. Word N-gram
Image alignment: from SIFT to deep learning
"Deep Learning from scratch" in Haskell (unfinished)
[Windows 10] "Deep Learning from scratch" environment construction
Learning record of reading "Deep Learning from scratch"
About data expansion processing for deep learning
I tried natural language processing with transformers.