Deep learning

[Deep Learning: Day1 NN] (https://qiita.com/matsukura04583/items/6317c57bc21de646da8e) [Deep Learning: Day2 CNN] (https://qiita.com/matsukura04583/items/29f0dcc3ddeca4bf69a2) [Deep Learning: Day3 RNN] (https://qiita.com/matsukura04583/items/9b77a238da4441e0f973) [Deep Learning: Day4 Reinforcement Learning / TensorFlow] (https://qiita.com/matsukura04583/items/50806b750c8d77f2305d)

Deep Learning: Day1 NN (Lecture Summary)

What you can do with neural networks (NN)
Regression
Expected results
Stock price forecast
Sales forecast
Ranking
Horse racing ranking forecast
Popularity ranking forecast
Classification
Identification of cat photos
Handwriting recognition
Flower type classification
Neural network ︓ Regression (approximation of a function that takes continuous real values)
[Regression analysis] • Linear regression • Regression tree • Random forest • Neural network (NN)
Neural network︓ Classification (gender (male or female) and animal type Analysis for predicting discrete results)
[Classification analysis] • Bayesian classification • Logistic regression • Random forest • Neural network (NN)

Section1) Input layer to intermediate layer

When expressing the weight as a large W, expressing it as a bias b, and suggesting both together, it is a complicated point to express it as a small w.
The middle layer can be set infinitely.

Section2) Activation function

In a neural network, a non-linear function that determines the magnitude of the output to the next layer.
It has the function of determining the ON / OFF and strength of the signal to the next layer depending on the value of the input value.
** Activation function for middle layer **
Step function

Formula

f(x) = \left\{
\begin{array}{ll}
1 & (x \geq 0) \\
0 & (x \lt 0)
\end{array}
\right.

`python`


def 
step_function(x):
 if x > 0:
    return 1
 else:
    return 0

Sigmoid (logistic) function

Formula

f(u) =  \frac{1}{1+e^{-u}}

`python`


def sigmoid(x):
  return 1/(1 + np.exp(-x))

It is a function that changes slowly between 0 and 1, and it has become possible to convey the strength of the signal to the state where the step function has only ON / OFF, which has triggered the spread of predictive neural networks. Task At large values, the change in output is small, which can cause a vanishing gradient problem.

ReLU function

f(x) = \left\{
\begin{array}{ll}
x & (x \gt 0) \\
0 & (x \leq 0)
\end{array}
\right.

`python`


def relu(x):
   return 
np.maximum(0, x)

The most used activation function now Good results have been achieved by contributing to avoiding the vanishing gradient problem and sparsification.

Vanishing gradient problem ・ Sparsification will be discussed in detail after the explanation of stochastic gradient descent.

** Activation function for output layer **
Softmax function
Identity map
Sigmoid function (logistic function)

Section3) Output layer

3-1 Error function

Error calculation Error function = Square error

En(w)=\frac{1}{2}\sum_{j=1}^{I} (y_j-d_j)^2 = \frac{1}{2}||(y-d)||^2

3-2 Output layer activation function

Difference from the middle layer of the output layer
[Strength of value]
Intermediate layer︓ Adjust signal strength before and after threshold
Output layer ︓ Signal magnitude (ratio) is converted as it is
[Probability output]
For classification problems, the output of the output layer should be limited to the range 0 to 1 and the sum should be 1.
The activation functions used in the output layer and the intermediate layer are different
Cross entropy

En(w)=-\sum_{i=1}^Id_ilog y_i

`python`


#Cross entropy
def cross_entropy_error(d, y):
    if y.ndim == 1:
        d = d.reshape(1, d.size)
        y = y.reshape(1, y.size)
        
    #Teacher data is one-hot-In case of vector, convert to index of correct label
    if d.size == y.size:
        d = d.argmax(axis=1)
             
    batch_size = y.shape[0]
    return -np.sum(np.log(y[np.arange(batch_size), d] + 1e-7)) / batch_size

A one-hot vector is a vector such as (0,1,0,0,0,0) where one component is 1 and the remaining components are all 0. (Reference) Sites examined by What is One-hot vector

Section4) Gradient descent method

Gradient descent
Fully connected NN – Gradient descent
Creating a network that minimizes errors through learning
The purpose is to find a parameter that minimizes the error E (w).
Optimize parameters using gradient descent
Learning rate $ \ varepsilon $ greatly changes the learning effect + $ W ^ {(t + 1)} = W ^ {(t)}-\ varepsilon \ nabla E (\ varepsilon is the learning rate) $
$ \ nabla E = \ frac {\ partial E} {\ partial W} = [\ frac {\ partial E} {\ partial w_1} ･･･ \ frac {\ partial E} {\ partial w_M}] $
Several algorithms for determining the learning rate of gradient descent and improving convergence. The paper is published + Momentum + AdaGrad + Adadelta + Adam

(Reference) Gradient descent method commentary site

Stochastic gradient descent
$ W ^ {(t + 1)} = W ^ {(t)}-\ varepsilon \ nabla E (\ varepsilon is the learning rate) $ ・・・ Gradient descent method
Gradient descent method is the average of all samples Error
+ $ W ^ {(t + 1)} = W ^ {(t)}-\ varepsilon \ nabla En (\ varepsilon is the learning rate) $ ・・・ Stochastic gradient descent
Stochastic gradient descent is the error of randomly sampled samples
Advantages of stochastic gradient descent
Reduction of calculation cost when data is long
Reduce the risk of converging on unwanted local minimal solutions
You can study online
[(Reference) Stochastic Gradient Descent Method Explanation Site](https://qiita.com/YudaiSadakuni/items/ece07b04c685a64eaa01#%E7%A2%BA%E7%8E%87%E7%9A%84%E5% 8B% BE% E9% 85% 8D% E9% 99% 8D% E4% B8% 8B% E6% B3% 95)
Mini-batch gradient descent + $ W ^ {(t + 1)} = W ^ {(t)}-\ varepsilon \ nabla En (\ varepsilon is the learning rate) $ ・・・ Stochastic gradient descent
Stochastic gradient descent is the error of randomly sampled samples

$ W ^ {(t + 1)} = W ^ {(t)}-\ varepsilon \ nabla Et (\ varepsilon is the learning rate) $ ・・・ Mini batch gradient descent method
E_t=\frac{1}{N_t}\sum_{n\in D_t}E_n
N_t=|D_t|

The mini-batch gradient descent method is a set of randomly extracted data () mini-batch) average error of samples belonging to $ D_t $

Advantages of mini-batch gradient descent Effective use of computer resources without compromising the advantages of stochastic gradient descent → Thread parallelization using CPU and SIMD parallelization using GPU

Section5) Error back propagation method

Error Gradient Calculation-Error Backpropagation Method [Error back propagation method] The calculated error is differentiated in order from the output layer side and propagated to the layer before the previous layer. A method of analytically calculating the differential value of each parameter with minimal calculation By back-calculating the derivative from the calculation result (= error), the derivative can be calculated while avoiding unnecessary recursive calculations.

`python`


#Error back propagation
def backward(x, d, z1, y):
    print("\n#####Error back propagation start#####")

    grad = {}

    W1, W2 = network['W1'], network['W2']
    b1, b2 = network['b1'], network['b2']
    #Delta at the output layer
    delta2 = functions.d_sigmoid_with_loss(d, y)
    #Gradient of b2
    grad['b2'] = np.sum(delta2, axis=0)
    #Gradient of W2
    grad['W2'] = np.dot(z1.T, delta2)
    #Delta in the middle layer
    delta1 = np.dot(delta2, W2.T) * functions.d_relu(z1)
    #Gradient of b1
    grad['b1'] = np.sum(delta1, axis=0)
    #Gradient of W1
    grad['W1'] = np.dot(x.T, delta1)
        
    print_vec("Partial differential_dE/du2", delta2)
    print_vec("Partial differential_dE/du2", delta1)

    print_vec("Partial differential_Weight 1", grad["W1"])
    print_vec("Partial differential_Weight 2", grad["W2"])
    print_vec("Partial differential_Bias 1", grad["b1"])
    print_vec("Partial differential_Bias 2", grad["b2"])

    return grad

Consideration of confirmation test

[P10] In deep learning, describe what you are trying to do in two lines or less. Also, which of the following values is the ultimate goal of optimization? Choose all. ① Input value [X] ② Output value [Y] ③ Weight [W] ④ Bias [b] ⑤ Total input [u] ⑥ Intermediate layer input [z] ⑦ Learning rate [ρ]

⇒ [Discussion] After all, deep learning aims to determine the parameters that minimize the error. The ultimate goal of optimizing the values is (3) weight [W] and (4) bias [b].

[P12] Put the following network on paper.

Input layer︓ 2 nodes 1 layer
Intermediate layer︓ 3 nodes 2 layers
Output layer: 1 node 1 layer

⇒ [Discussion] It's easy to understand if you write it yourself.

[P19] Confirmation test Let's put an example of animal classification in this diagram![P19.gif](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/357717/6a0b680d-9466-598d- 67e9-d9156a754193.gif)

⇒ [Discussion]

[P21] Confirmation test

Write this expression in python

u=w_1x_1+w_2x_2+w_3x_3+w_4x_4+b=Wx+b..(1.2)

⇒ [Discussion]

`pyhon`


u1=np.dot(x,W1)+b1

[P23] Confirmation test Extract the code that represents the middle layer

⇒ [Discussion]

`pyhon`


#Total input of hidden layers
u1 = np.dot(x, W1) + b1
#Total output of hidden layer
z1 = functions.relu(u1)

The RELU function is explained separately.

[P26] Confirmation test Explain the difference between linear and non-linear with a diagram.

[P34] Confirmation test Fully coupled NN-single layer, multiple nodes Extract the relevant part from the distributed source code. ⇒ [Discussion] Since the activation function f (u) is a sigmoid function, this is the part.

`python`


z1 = functions.sigmoid(u)

[P34] Confirmation test Error calculation Error function = Square error

En(w)=\frac{1}{2}\sum_{j=1}^{I} (y_j-d_j)^2 = \frac{1}{2}||(y-d)||^2

・ Describe why you square instead of subtraction ・ Describe what half of the formula below means.

⇒ [Discussion] ・ To express the variance as a plus ・ 1/2 is the average value (Reference) The site here was easy to understand Meaning and calculation method of least squares method --How to find regression line

Originally, in the case of a classification problem, the cross entropy error is used for the error function, so the code is
** loss = cross_entropy_error (d, y) **
, but this time, for convenience of explanation, the mean square is added to the error function. The code is
** loss = functions.mean_squared_error (d, y) **
because of the error.

[P51] Confirmation test (S3_2 output layer_activation function) Softmax function

①f(i,u)=\frac{e^{u_i}②}{\sum_{k=1}^{k}e^{u_k}③}

Show the source code corresponding to the formulas (1) to (3) and explain line by line.

`python`


def softmax(x):
   if x.ndim == 2:#If it was two-dimensional
     x = x.Tx
     x = x-np.max(x, axis=0)
     y = np.exp(x) /np.sum(np.exp(x), axis=0)
      return y.T
 x = x -np.max(x) #Overflow measures
      return np.exp(x) / np.sum(np.exp(x))

① ・・・・ y (returning transposition from return y.T) ② ・・・・ np.exp (x) part ③ ・・・・ np.sum (np.exp (x), axis = 0) part

(Learning reference) What does NumPy's axis and dimension mean?

[P53] Confirmation test(S3_2 Output layer_Activation function)
Cross entropy
①~Show the source code corresponding to the formula in 2 and explain the process line by line.

```math
En(w)=-\sum_{i=1}^Id_ilog y_i

En(w)・・・ Part ① -\sum_{i=1}^Id_ilog y_i・・・ Part ②

⇒ [Discussion] ・ Return-np.sum(np.log(y[np.arange(batch_size), d] + 1e-7)) / batch_size ・ 1/2 is taking the average value

`python`


# Cross entropy
def cross_entropy_error(d, y):
    if y.ndim == 1:
        d = d.reshape(1, d.size)
        y = y.reshape(1, y.size)
        
 # If the teacher data is one-hot-vector, convert it to the index of the correct label
    if d.size == y.size:
        d = d.argmax(axis=1)
             
    batch_size = y.shape[0]
    return -np.sum(np.log(y[np.arange(batch_size), d] + 1e-7)) / batch_size

En(w)・・・ The part ① is the return value -\sum_{i=1}^Id_ilog y_i・・・ Part ② -np.sum(np.log(y[np.arange(batch_size), d] + 1e-7)) / batch_The size part.

『1e-7 ”is added because it is a fine number so that the log part does not become zero. +Teacher data is one-hot vector or not(Teacher data is an integer, etc.)Is important. one-If it is not a hot vector, you need to use another function that uses keras.

[P56] Confirmation test(S4 gradient descent method) Find the appropriate source code for the gradient descent function.

W^{(t+1)} =W^{(t)}-\varepsilon\nabla E_n (\varepsilon has a learning rate)・・・・ ① \nabla E=\frac{\partial E}{\partial W}=[\frac{\partial E}{\partial w_1}･･･\frac{\partial E}{\partial w_M}]・・・・ ②

⇒ [Discussion]

`python`


# error
loss = functions.cross_entropy_error(d, y)

 grad = backward (x, d, z1, y) # Corresponds to the part ②
for key in ('W1', 'W2', 'b1', 'b2'):
 network [key]-= learning_rate * grad [key] # ①

[P65] Confirmation test(S4 gradient descent method) Summarize what online learning is. ⇒ [Discussion] Online learning means that a learning model can be created using only newly acquired data. It can be turned without utilizing existing data.

[P69] Confirmation test(S4 gradient descent method) Explain the meaning of this formula in a diagram. + W^{(t+1)} =W^{(t)}-\varepsilon\nabla Et (\varepsilon has a learning rate)・・・ Mini batch gradient descent method
⇒ [Discussion] (〇〇〇)　(〇〇〇)　(〇〇〇) Set 1 Set 2 Set 3 In this case, add up the errors with any one dataset as a set of mini-batch, 1/3 　　E_t=\frac{1}{N_t}\sum_{n\in D_t}E_n 　　 N_t=|D_t|
[P78] Confirmation test(S5 error back propagation method) The error back propagation method can avoid unnecessary recursive processing. Extract the source code that holds the calculation results that have already been performed.

`python`



# Error back propagation
def backward(x, d, z1, y):
 print ("\ n ##### Error back propagation start #####")

    grad = {}

    W1, W2 = network['W1'], network['W2']
    b1, b2 = network['b1'], network['b2']
 #Delta in the output layer ## Here, the derivative of the function that combines the sigmoid function and the cross entropy is calculated and assigned to "delta2".
    delta2 = functions.d_sigmoid_with_loss(d, y)
 Gradient of # b2 ## Using "delta2"
    grad['b2'] = np.sum(delta2, axis=0)
 # W2 gradient ## Using "delta2"
    grad['W2'] = np.dot(z1.T, delta2)
 # Delta in the middle layer ## Using "delta2"
    delta1 = np.dot(delta2, W2.T) * functions.d_relu(z1)
 Gradient of # b1
    grad['b1'] = np.sum(delta1, axis=0)
 # W1 gradient
    grad['W1'] = np.dot(x.T, delta1)
        
 print_vec ("Partial derivative_dE / du2", delta2)
 print_vec ("Partial derivative_dE / du2", delta1)

 print_vec ("Partial derivative_weight 1", grad ["W1"])
 print_vec ("Partial derivative_weight 2", grad ["W2"])
 print_vec ("Partial derivative_bias 1", grad ["b1"])
 print_vec ("Partial derivative_bias 2", grad ["b2"])

    return grad

[P83] Find the source code that corresponds to the two blanks.(S5 error back propagation method) \frac{\partial E}{\partial y} \frac{\partial y}{\partial u}

`python`


# Delta at the output layer
    delta2 = functions.d_mean_squared_error(d, y)

\frac{\partial E}{\partial y} \frac{\partial y}{\partial u} \frac{\partial u}{\partial w _{ji}^{(2)}}

`python`


# Delta at the output layer
 # W2 gradient
    grad['W2'] = np.dot(z1.T, delta2)

#Exercise

DN06_Jupyter exercise

`python`



# Let's try #


# Forward propagation (single layer / single unit)

# weight
W = np.array([[0.1], [0.2]])

## Let's try _ array initialization
W = np.zeros(2)
 W = np.ones (2) # Select here
W = np.random.rand(2)
W = np.random.randint(5, size=(2))

 print_vec ("weight", W)


# bias
b = 0.5

## Let's try _ Numerical initialization
 b = np.random.rand () # Random number from 0 to 1 # Select this
# b = np.random.rand () * 10 -5 # -5 ~ 5 random numbers

 print_vec ("bias", b)

# Input value
x = np.array([2, 3])
 print_vec ("input", x)


# Total input
u = np.dot(x, W) + b
 print_vec ("total input", u)

# Intermediate layer output
z = functions.relu(u)
 print_vec ("intermediate layer output", z)

weight [1. 1.]

bias 0.15691869859919338

input [2 3]

Total input 5.156918698599194

Intermediate layer output 5.156918698599194

`python`



# Let's try #

# Forward propagation (single layer / multiple units)

# weight
W = np.array([
  [0.1, 0.2, 0.3],
  [0.2, 0.3, 0.4], 
  [0.3, 0.4, 0.5],
  [0.4, 0.5, 0.6]
  ])

## Let's try _ array initialization
W = np.zeros((4,3))
 W = np.ones ((4,3)) # Select here
W = np.random.rand(4,3)
W = np.random.randint(5, size=(4,3))

 print_vec ("weight", W)

# bias
b = np.array([0.1, 0.2, 0.3])
 print_vec ("bias", b)

# Input value
x = np.array([1.0, 5.0, 2.0, -1.0])
 print_vec ("input", x)


# Total input
u = np.dot(x, W) + b
 print_vec ("total input", u)

# Intermediate layer output
z = functions.sigmoid(u)
 print_vec ("intermediate layer output", z)

weight [[1. 1. 1.] [1. 1. 1.] [1. 1. 1.] [1. 1. 1.]]

bias [0.1 0.2 0.3]

input [ 1. 5. 2. -1.]

Total input [7.1 7.2 7.3]

Intermediate layer output [0.99917558 0.99925397 0.99932492]

`python`



# Let's try #

# Multi-class classification
# 2-3-4 network

# !! Let's try _ Let's change the node configuration to 3-5-4

# Set weights and biases
# Create a work
def init_network():
 print ("##### Network initialization #####")

 #Let's try
 #_ Display the shape of each parameter
 #_ Network initial value random generation

    network = {}
    network['W1'] = np.array([
        [0.1, 0.4, 0.7, 0.1, 0.3],
        [0.2, 0.5, 0.8, 0.1, 0.4],
        [0.3, 0.6, 0.9, 0.2, 0.5]
    ])
    network['W2'] = np.array([
        [0.1, 0.6, 0.1, 0.6],
        [0.2, 0.7, 0.2, 0.7],
        [0.3, 0.8, 0.3, 0.8],
        [0.4, 0.9, 0.4, 0.9],
        [0.5, 0.1, 0.5, 0.1]
    ])
    network['b1'] = np.array([0.1, 0.2, 0.3, 0.4, 0.5])
    network['b2'] = np.array([0.1, 0.2, 0.3, 0.4])
    
 print_vec ("weight 1", network ['W1'])
 print_vec ("weight 2", network ['W2'])
 print_vec ("bias 1", network ['b1'])
 print_vec ("bias 2", network ['b2'])

    return network

# Create a process
# x: Input value
def forward(network, x):
    
 print ("##### Start propagation #####")
    W1, W2 = network['W1'], network['W2']
    b1, b2 = network['b1'], network['b2']
    
 # 1 layer total input
    u1 = np.dot(x, W1) + b1

 # 1 layer total output
    z1 = functions.relu(u1)

 # 2 layers total input
    u2 = np.dot(z1, W2) + b2
    
 # Output value
    y = functions.softmax(u2)
    
 print_vec ("total input 1", u1)
 print_vec ("intermediate layer output 1", z1)
 print_vec ("total input 2", u2)
 print_vec ("output 1", y)
 print ("total output:" + str (np.sum (y)))
        
    return y, z1

## Preliminary data
# Input value
x = np.array([1., 2., 3.])

# Target output
d = np.array([0, 0, 0, 1])

# Network initialization
network =  init_network()

# output
y, z1 = forward(network, x)

# error
loss = functions.cross_entropy_error(d, y)

## display
 print ("\ n ##### Result display #####")
 print_vec ("output", y)
 print_vec ("training data", d)
 print_vec ("error", loss)

#####Network initialization##### Weight 1 [[0.1 0.4 0.7 0.1 0.3] [0.2 0.5 0.8 0.1 0.4] [0.3 0.6 0.9 0.2 0.5]]

Weight 2 [[0.1 0.6 0.1 0.6] [0.2 0.7 0.2 0.7] [0.3 0.8 0.3 0.8] [0.4 0.9 0.4 0.9] [0.5 0.1 0.5 0.1]]

Bias 1 [0.1 0.2 0.3 0.4 0.5]

Bias 2 [0.1 0.2 0.3 0.4]

#####Start forward propagation##### Total input 1 [1.5 3.4 5.3 1.3 3.1]

Intermediate layer output 1 [1.5 3.4 5.3 1.3 3.1]

Total input 2 [4.59 9.2 4.79 9.4 ]

Output 1 [0.00443583 0.44573018 0.00541793 0.54441607]

Output total: 1.0

#####Result display##### output [0.00443583 0.44573018 0.00541793 0.54441607]

Training data [0 0 0 1]

error 0.6080413107681358

`python`



# Let's try #


# Regression
# 2-3-2 Network

# !! Let's try _ Let's change the node configuration to 3-5-4

# Set weights and biases
# Create a work
def init_network():
 print ("##### Network initialization #####")

    network = {}
    network['W1'] = np.array([
        [0.1, 0.4, 0.7, 0.1, 0.3],
        [0.2, 0.5, 0.8, 0.1, 0.4],
        [0.3, 0.6, 0.9, 0.2, 0.5] 
    ])
    network['W2'] = np.array([
       [0.1, 0.6, 0.1, 0.6],
        [0.2, 0.7, 0.2, 0.7],
        [0.3, 0.8, 0.3, 0.8],
        [0.4, 0.9, 0.4, 0.9],
        [0.5, 0.1, 0.5, 0.1]
    ])
    network['b1'] = np.array([0.1, 0.2, 0.3, 0.4, 0.5])
    network['b2'] = np.array([0.1, 0.2, 0.3, 0.4])
    
 print_vec ("weight 1", network ['W1'])
 print_vec ("weight 2", network ['W2'])
 print_vec ("bias 1", network ['b1'])
 print_vec ("bias 2", network ['b2'])

    return network

# Create a process
def forward(network, x):
 print ("##### Start propagation #####")
    
    W1, W2 = network['W1'], network['W2']
    b1, b2 = network['b1'], network['b2']
 # Total input of hidden layer
    u1 = np.dot(x, W1) + b1
 # Total output of hidden layer
    z1 = functions.relu(u1)
 # Total input of output layer
    u2 = np.dot(z1, W2) + b2
 # Total output of output layer
    y = u2
    
 print_vec ("total input 1", u1)
 print_vec ("intermediate layer output 1", z1)
 print_vec ("total input 2", u2)
 print_vec ("output 1", y)
 print ("total output:" + str (np.sum (z1)))
    
    return y, z1

# Input value
x = np.array([1., 2., 3.])
network =  init_network()
y, z1 = forward(network, x)
# Target output
d = np.array([2., 3.,4.,5.])
# error
loss = functions.mean_squared_error(d, y)
## display
 print ("\ n ##### Result display #####")
 print_vec ("intermediate layer output", z1)
 print_vec ("output", y)
 print_vec ("training data", d)
 print_vec ("error", loss)

#####Network initialization##### Weight 1 [[0.1 0.4 0.7 0.1 0.3] [0.2 0.5 0.8 0.1 0.4] [0.3 0.6 0.9 0.2 0.5]]

Weight 2 [[0.1 0.6 0.1 0.6] [0.2 0.7 0.2 0.7] [0.3 0.8 0.3 0.8] [0.4 0.9 0.4 0.9] [0.5 0.1 0.5 0.1]]

Bias 1 [0.1 0.2 0.3 0.4 0.5]

Bias 2 [0.1 0.2 0.3 0.4]

#####Start forward propagation##### Total input 1 [1.5 3.4 5.3 1.3 3.1]

Intermediate layer output 1 [1.5 3.4 5.3 1.3 3.1]

Total input 2 [4.59 9.2 4.79 9.4 ]

Output 1 [4.59 9.2 4.79 9.4 ]

Output total: 14.6

#####Result display##### Intermediate layer output [1.5 3.4 5.3 1.3 3.1]

output [4.59 9.2 4.79 9.4 ]

Training data [2. 3. 4. 5.]

error 8.141525

`python`



# Let's try #


# Binary classification
# 2-3-1 Network

# !! Let's try _ Let's change the node configuration to 5-10-1

# Set weights and biases
# Create a work
def init_network():
 print ("##### Network initialization #####")

    network = {}
    network['W1'] = np.array([
        [0.1, 0.3, 0.5,0.1, 0.3, 0.5,0.1, 0.3, 0.5, 0.6],
        [0.2, 0.4, 0.6,0.2, 0.4, 0.6,0.2, 0.4, 0.6,0.7],
        [0.2, 0.4, 0.6,0.2, 0.4, 0.6,0.2, 0.4, 0.6,0.7],
        [0.2, 0.4, 0.6,0.2, 0.4, 0.6,0.2, 0.4, 0.6,0.7],
        [0.2, 0.4, 0.6,0.2, 0.4, 0.6,0.2, 0.4, 0.6,0.7]
    ])
    network['W2'] = np.array([
       [0.1],
       [0.1],
       [0.1],
       [0.1],
       [0.1],
       [0.1],
       [0.1],
       [0.1],
       [0.1],
       [0.1]
    ])
    network['b1'] = np.array([0.1, 0.3, 0.5,0.1, 0.3, 0.5,0.1, 0.3, 0.5, 0.6])
    network['b2'] = np.array([0.1])
    return network


# Create a process
def forward(network, x):
 print ("##### Start propagation #####")
    
    W1, W2 = network['W1'], network['W2']
    b1, b2 = network['b1'], network['b2']    

 # Total input of hidden layer
    u1 = np.dot(x, W1) + b1
 # Total output of hidden layer
    z1 = functions.relu(u1)
 # Total input of output layer
    u2 = np.dot(z1, W2) + b2
 # Total output of output layer
    y = functions.sigmoid(u2)
            
 print_vec ("total input 1", u1)
 print_vec ("intermediate layer output 1", z1)
 print_vec ("total input 2", u2)
 print_vec ("output 1", y)
 print ("total output:" + str (np.sum (z1)))

    return y, z1


# Input value
x = np.array([1., 2., 3., 4., 5.])
# Target output
d = np.array([1])
network =  init_network()
y, z1 = forward(network, x)
# error
loss = functions.cross_entropy_error(d, y)

## display
 print ("\ n ##### Result display #####")
 print_vec ("intermediate layer output", z1)
 print_vec ("output", y)
 print_vec ("training data", d)
 print_vec ("error", loss)

#####Network initialization##### #####Start forward propagation##### Total input 1 [ 3. 6.2 9.4 3. 6.2 9.4 3. 6.2 9.4 11. ]

Intermediate layer output 1 [ 3. 6.2 9.4 3. 6.2 9.4 3. 6.2 9.4 11. ]

Total input 2 [6.78]

Output 1 [0.99886501]

Output total: 66.8

#####Result display##### Intermediate layer output [ 3. 6.2 9.4 3. 6.2 9.4 3. 6.2 9.4 11. ]

output [0.99886501]

Training data [1]

error 0.0011355297129812408 ⇒ [Discussion] It was found that the output result changes greatly depending on the size of the intermediate layer. How do you decide the number of units? I found it difficult to decide on the middle class. The number of units in the input layer needs to match the dimensions of the data so you don't get lost. And you don't have to think about the number of units in the output layer as you prepare as many as the number of classified classes. I would like to find out if the approximation works if the number of middle layers is increased enormously. I also found it difficult to set weights and biases.

DN15_Jupyter Exercise 2

`python`



# Let's try #


# Sample function
# AI that predicts the value of y

def f(x):
    y = 3 * x[0] + 2 * x[1]
    return y

# Initial setting
def init_network():
 # print ("##### Network initialization #####")
    network = {}
    nodesNum = 10
    network['W1'] = np.random.randn(2, nodesNum)
    network['W2'] = np.random.randn(nodesNum)
    network['b1'] = np.random.randn(nodesNum)
    network['b2'] = np.random.randn()

 # print_vec ("weight 1", network ['W1'])
 # print_vec ("weight 2", network ['W2'])
 # print_vec ("bias 1", network ['b1'])
 # print_vec ("bias 2", network ['b2'])

    return network

# Forward propagation
def forward(network, x):
 # print ("##### Sequential propagation start #####")
    
    W1, W2 = network['W1'], network['W2']
    b1, b2 = network['b1'], network['b2']
    u1 = np.dot(x, W1) + b1
    #z1 = functions.relu(u1)
    
 ## Let's try
 z1 = functions.sigmoid (u1) # Select and try sigmoid
    
    u2 = np.dot(z1, W2) + b2
    y = u2

 # print_vec ("total input 1", u1)
 # print_vec ("Middle layer output 1", z1)
 # print_vec ("total input 2", u2)
 # print_vec ("output 1", y)
 # print ("total output:" + str (np.sum (y)))
    
    return z1, y

# Error back propagation
def backward(x, d, z1, y):
 # print ("\ n ##### Error back propagation start #####")

    grad = {}
    
    W1, W2 = network['W1'], network['W2']
    b1, b2 = network['b1'], network['b2']

 #Delta in the output layer
    delta2 = functions.d_mean_squared_error(d, y)
 Gradient of # b2
    grad['b2'] = np.sum(delta2, axis=0)
 # W2 gradient
    grad['W2'] = np.dot(z1.T, delta2)
 # Delta in the middle layer
    #delta1 = np.dot(delta2, W2.T) * functions.d_relu(z1)

 ## Let's try #Select sigmoid and try
    delta1 = np.dot(delta2, W2.T) * functions.d_sigmoid(z1)

    delta1 = delta1[np.newaxis, :]
 Gradient of # b1
    grad['b1'] = np.sum(delta1, axis=0)
    x = x[np.newaxis, :]
 # W1 gradient
    grad['W1'] = np.dot(x.T, delta1)
    
 # print_vec ("Partial derivative_weight 1", grad ["W1"])
 # print_vec ("Partial derivative_weight 2", grad ["W2"])
 # print_vec ("Partial derivative_bias 1", grad ["b1"])
 # print_vec ("Partial derivative_bias 2", grad ["b2"])

    return grad

# Create sample data
data_sets_size = 100000
data_sets = [0 for i in range(data_sets_size)]

for i in range(data_sets_size):
    data_sets[i] = {}
 # Set a random value
    # data_sets[i]['x'] = np.random.rand(2)
    
 ## Let's try _ Input value setting # Select this to try
 data_sets [i] ['x'] = np.random.rand (2) * 10 -5 # Random number from -5 to 5
    
 # Set target output
    data_sets[i]['d'] = f(data_sets[i]['x'])
    
losses = []
# Learning rate
learning_rate = 0.07

# Number of extracts
epoch = 1000

# Parameter initialization
network = init_network()
# Random extraction of data
random_datasets = np.random.choice(data_sets, epoch)

# Repeated gradient descent
for dataset in random_datasets:
    x, d = dataset['x'], dataset['d']
    z1, y = forward(network, x)
    grad = backward(x, d, z1, y)
 #Apply gradient to parameter
    for key in ('W1', 'W2', 'b1', 'b2'):
        network[key]  -= learning_rate * grad[key]

 #Error
    loss = functions.mean_squared_error(d, y)
    losses.append(loss)

 print ("##### Result display #####")
lists = range(epoch)


plt.plot(lists, losses, '.')
# Graph display
plt.show()

⇒ [Discussion] By changing from the Relu function to the sigmoid function, the variance closer to 0 in the graph has expanded. Also,-By selecting a random number from 5 to 5, #Completion assignment

DN16_Completion assignment (production assignment)

+(Problem) Create Deep Learninng using IRIS data

###design Create a model that predicts by training IRIS data in a 2: 1 ratio between training data and test data.

Input layer: 4 dimensions *Hidden layer: 6 dimensions *Output layer: 3D (due to 3D classification problem) *Bonding between layers: Dense *Input layer → hidden layer activation function: relu *Hidden layer → Output layer activation function: softmax *Optimization: Gradient descent *Loss function: cross entropy

`python`


import numpy as np

# Hyperparameters
 INPUT_SIZE = 4 # number of input nodes
 HIDDEN_SIZE = 6 # Number of neurons in the middle layer (hidden layer)
 OUTPUT_SIZE = 3 # Number of neurons in the output layer
 TRAIN_DATA_SIZE = 50 # Use TRAIN_DATA_SIZE as training data out of 150 data. The rest is used as teacher data.
 LEARNING_RATE = 0.1 # Learning rate
 EPOCH = 1000 # Number of repeated learnings (number of epochs)

# Read data

# Get the Iris dataset here. Since the data is sorted by type with headings, CSV data is prepared so that 150 data are mixed in 3 types and 10 cases each.
　https://gist.github.com/netj/8836201

x = np.loadtxt('/content/drive/My Drive/DNN_code/data/iris.csv', delimiter=',',skiprows=1, usecols=(0, 1, 2, 3))
raw_t = np.loadtxt('/content/drive/My Drive/DNN_code/data/iris.csv',  delimiter=',',skiprows=1,dtype="unicode", usecols=(4,))

t = np.zeros([150])

for i in range(0,150):
  vari = raw_t[i]
  #print(vari,raw_t[i],i)
  if ("Setosa" in vari):
      t[i] = int(0)
  elif ("Versicolor" in vari):
      t[i] = int(1)
  elif ("Virginica" in vari):
      t[i] = int(2)
  else:
 print ("error", i)

a = [3, 0, 8, 1, 9]
a = t.tolist()
a_int = [int(n) for  n in a]
print(a_int)

a_one_hot = np.identity(10)[a_int]
a_one_hot = np.identity(len(np.unique(a)))[a_int]

print(a_one_hot)

train_x = x[:TRAIN_DATA_SIZE]
train_t = a_one_hot[:TRAIN_DATA_SIZE]
test_x = x[TRAIN_DATA_SIZE:]
test_t = a_one_hot[TRAIN_DATA_SIZE:]

print("train=",TRAIN_DATA_SIZE,train_x,train_t)
print("test=",test_x,test_t)

# Weight / bias initialization #He initial value (for using ReLU)
W1 = np.random.randn(INPUT_SIZE, HIDDEN_SIZE) / np.sqrt(INPUT_SIZE) * np.sqrt(2)  
W2 = np.random.randn(HIDDEN_SIZE, OUTPUT_SIZE)/ np.sqrt(HIDDEN_SIZE) * np.sqrt(2)
# Adjust from initial value zero
b1 = np.zeros(HIDDEN_SIZE) 
b2 = np.zeros(OUTPUT_SIZE)

# ReLU function
def relu(x):
    return np.maximum(x, 0)

# Softmax function
def softmax(x):
    if x.ndim == 2:
        x = x.T
        x = x - np.max(x, axis=0)
        y = np.exp(x) / np.sum(np.exp(x), axis=0)
        return y.T

 x = x --np.max (x) # Overflow measures
    return np.exp(x) / np.sum(np.exp(x))

# Cross entropy error
def cross_entropy_error(y, t):
    if y.shape != t.shape:
        raise ValueError
    if y.ndim == 1:
        return - (t * np.log(y)).sum()
    elif y.ndim == 2:
        return - (t * np.log(y)).sum() / y.shape[0]
    else:
        raise ValueError

# Forward propagation
def forward(x):
    global W1, W2, b1, b2
    return softmax(np.dot(relu(np.dot(x, W1) + b1), W2) + b2)

# Test data results
test_y = forward(test_x)
 print ("Before learning =", (test_y.argmax (axis = 1) == test_t.argmax (axis = 1)). Sum (),'/', 150 --TRAIN_DATA_SIZE)

# Learning loop
for i in range(EPOCH):
 # Forward propagation with data storage
    y1 = np.dot(train_x, W1) + b1
    y2 = relu(y1)
    train_y = softmax(np.dot(y2, W2) + b2)

 # Loss function calculation
    L = cross_entropy_error(train_y, train_t)

 if i% 100 == 0: remainder of # 100
        print("L=",L)

 # Gradient calculation
    a1 = (train_y - train_t) / TRAIN_DATA_SIZE
    b2_gradient = a1.sum(axis=0)
    W2_gradient = np.dot(y2.T, a1)
    a2 = np.dot(a1, W2.T)
    a2[y1 <= 0.0] = 0
    b1_gradient = a2.sum(axis=0)
    W1_gradient = np.dot(train_x.T, a2)

 #Parameter update
    W1 = W1 - LEARNING_RATE * W1_gradient
    W2 = W2 - LEARNING_RATE * W2_gradient
    b1 = b1 - LEARNING_RATE * b1_gradient
    b2 = b2 - LEARNING_RATE * b2_gradient

# Result display

# L value of final training data
L = cross_entropy_error(forward(train_x), train_t)
 print ("L value of final training data =", L)

# Test data results
test_y = forward(test_x)
 print ("After learning =", (test_y.argmax (axis = 1) == test_t.argmax (axis = 1)). sum (),'/', 150 --TRAIN_DATA_SIZE)

(result) Before learning = 42/ 100 L= 4.550956552060894 L= 0.3239415165787326 L= 0.2170679838829666 L= 0.04933110713361697 L= 0.0273865499319481 L= 0.018217122389043848 L= 0.013351028977015358 L= 0.010399165844496665 L= 0.008444934117102367 L= 0.007068429052588092 L value of final training data= 0.0060528995955394386 After learning = 89/ 100

⇒ [Discussion]

Q1.What is the purpose of the assignment? What kind of device can be done
The purpose of the task is to learn the basic mechanism of deep learning throughout. It can be devised by adjusting the number of intermediate layers and hyperparameters.
Q２.What is the meaning of solving a task with a classification task?
For classification tasks, you can learn to generate training and test data.
Q３.What is IRIS data? Two lines
Data prepared for learning machine learning and deep learning.
Four features are provided based on the classification of three types of irises.

<Course> Deep Learning: Day1 NN

Deep learning

Deep Learning: Day1 NN (Lecture Summary)

Section1) Input layer to intermediate layer

Section2) Activation function

python

python

python

Section3) Output layer

3-1 Error function

3-2 Output layer activation function

python

Section4) Gradient descent method

Section5) Error back propagation method

python

Consideration of confirmation test

pyhon

pyhon

python

python

python

python

python

python

python

DN06_Jupyter exercise

python

python

python

python

python

DN15_Jupyter Exercise 2

python

DN16_Completion assignment (production assignment)

python

`python`

`python`

`python`

`python`

`python`

`pyhon`

`pyhon`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`