Day2
** Section1) Vanishing Gradation Problem Overview ** 1-1 Activation function ・ ReLU function ・ Sigmoid (logistic) function ・ Hyperbolic tangent 1-2 How to set the initial value -Xavier --- The value obtained by dividing the weight element by the square root of the number of nodes in the previous layer. ⇒ Target activation function: ReLU, sigmoid, hyperbolic tangent function
#Initial value of Xavier
network['W1'] = np.random.randn(input_layer_size, hidden_layer_1_size) / (np.sqrt(input_layer_size))
network['W2'] = np.random.randn(hidden_layer_1_size, hidden_layer_2_size) / (np.sqrt(hidden_layer_1_size))
network['W3'] = np.random.randn(hidden_layer_2_size, output_layer_size) / (np.sqrt(hidden_layer_2_size))
-He--The weight element divided by the square root of the number of nodes in the previous layer is $ \sqrt{2} Value multiplied by $ ⇒ Target activation function: ReLU
#Initial value of He
network['W1'] = np.random.randn(input_layer_size, hidden_layer_1_size) / np.sqrt(input_layer_size) * np.sqrt(2)
network['W2'] = np.random.randn(hidden_layer_1_size, hidden_layer_2_size) / np.sqrt(hidden_layer_1_size) * np.sqrt(2)
network['W3'] = np.random.randn(hidden_layer_2_size, output_layer_size) / np.sqrt(hidden_layer_2_size) * np.sqrt(2)
1-3 Batch normalization Batch normalization is a method of suppressing the bias of input value data in mini-batch units. Add batch normalization before and after passing the value to the activation function at the timing of batch normalization.
** Section2) Learning rate optimization method ** If the value of the learning rate is large, the optimum value will not be reached forever and will diverge. If the learning rate value is small, it will not diverge, but if it is too small, it will take time to converge. In addition, it becomes difficult to converge to the global local optimum value. ◆ Learning rate optimization method: ・ Set the initial learning rate to a large value and gradually decrease the learning rate. ・ Variable learning rate for each parameter 2-1 Momentum After subtracting the product of the error differentiated by the parameter and the learning rate (gradient descent method), the product of the current weight minus the previous weight and the inertia is added. ◆ Benefits of momentum ・ It is not a local optimum solution, but a global optimum solution. ・ The time from the valley to the lowest position (optimum value) is fast. 2-2 AdaGrad Subtract the product of the redefined learning rate and the parameter derivative of the error. ◆ Advantages of AdaGrad ・ For slopes with gentle slopes, approach the optimum value. ◆ Issues ・ Since the learning rate gradually decreased, it sometimes caused a saddle point problem.
2-3 RMSProp Subtract the product of the redefined learning rate and the parameter derivative of the error. ◆ Advantages of RMS Drop ・ It is not a local optimum solution, but a global optimum solution. -There are few cases where hyperparameters need to be adjusted. 2-4 Adam An optimization method that incorporates momentum (exponential decay average of past gradients) and RMSProp (exponential decay average of the square of past gradients) ◆ Advantages of Adam ・ It is an optimization algorithm that has the advantages of momentum and RMS Drop.
** Section3) About overfitting ** The learning curve deviates between the test error and the training error ⇒For the following reasons, the degree of freedom of the network (number of layers, number of nodes, parameter values, etc.) is high, and we have specialized in learning for a specific training sample. • Large number of parameters • Parameter values are incorrect • Many nodes, etc ...
3-1 L1 regularization, L2 regularization ⇒ Lasso, Ridge regularization
3-2 Drop A Randomly delete nodes to learn ◆ As a merit ・ It can be interpreted that different models are trained without changing the amount of data.
** Section4) Conceptual overview of convolutional neural network (CNN structure diagram) ** 4-1 Convolution layer The result of the convolution operation for each filter is output. ◆ As a merit ・ By maintaining spatial information, the shortcomings of the fully connected layer can be overcome.
4-2 Pooling layer Applies after the convolution layer. Transform the input data into a more manageable form. Compress the information and down sample. ◆ As a merit ・ Being robust against minute changes in position ・ Suppress overfitting to some extent ・ Reduce calculation cost
4-3 Other used layers ・ Fully connected network layer ・ Dropout layer ・ Batch Normalization layer, etc.
** Section 5) Latest CNN ** 5-1 AlexNet ・ Model structure ⇒ Consists of 3 fully connected layers, such as 5 convolution layers and pooling layers ・ Measures to prevent overfitting ⇒ A dropout is used for the output of the fully connected layer of size 4096.
[P20] Find dz / dx using the principle of chain rule. z = t2 t = x + y
** ● Consideration: ** $ \frac{dz}{dx}=\frac{dz}{dt}\frac{dt}{dx} $ $ ⇒ \frac{dz}{dt} = 2t, \frac{dt}{dx} = 1 ,t=x+y $ Substitution result of the above formula: $ \frac{dz}{dx} = 2(x+y) $
[P12] When the sigmoid function is differentiated, the maximum value is taken when the input value is 0. Select the correct value from the options. (1)0.15 (2)0.25 (3)0.35 (4)0.4
** ● Consideration: ** $ sigmoid'(x) = sigmoid(x)(1-sigmoid(x)) $ $ sigmoid(0) = 0.5 $
Substitution result of the above formula: $ sigmoid'(0) = 0.5*(1-0.5) = 0.25 $ Therefore, the correct answer is (2)
[P28] What kind of problem occurs when the initial value of the weight is set to 0? Explain briefly.
** ● Consideration: ** If 0 is set as the initial value of the weight, it will be 0 when the weight is multiplied by the input value, and all the values will be transmitted to the lower layer with the same value, so the weight value cannot be tuned.
[P31] List two commonly considered effects of batch normalization. ** ● Consideration: ** Stabilize the learning process as a whole. Increase learning speed.
[P47] Briefly explain the characteristics of Momentum, AdaGrad, and RMS Drop.
** ● Consideration: **
・ Characteristics of momentum
The parameters are updated by adding α times to the previous update amount in consideration of inertia. (Actually, automatic adjustment of learning rate)
・ Features of AdaGrad
The learning rate is automatically adjusted by considering all the past gradient information evenly, and the learning rate η calculated as the learning is repeated becomes smaller.
・ Features of RMS Drop
The learning rate η calculated as the learning is repeated by taking the exponential moving average of the square of the past gradient becomes smaller.
[P68] Regarding the figure below, answer which graph shows L1 regularization. ** ● Consideration: **
[P100] Answer the size of the output image when the input image of size 6x6 is folded with the filter of size 2x2. The stride and padding are set to 1. ** ● Consideration: ** Answer: Height: (6 + 2 * 0 --2) / 2 + 1 = 3 Width: (6 + 2 * 0 --2) / 2 + 1 = 3
Gradient disappearance
================================================================================================================= Day3
* • AlexNet* AlexNet is a model that won the second place in the image recognition competition held in 2012 by a large margin. With the advent of AlexNet, deep learning has received a lot of attention
** Section1) Concept of recurrent neural network **
・ Overview of RNN
It is a neural network that can handle data that is observed at regular intervals and has statistical dependencies on each other.
For example: voice data, text data, etc.
・ About RNN
RNN has a recursive structure that holds the initial state and the state of the past time t-1 in the hidden layer, and recursively finds t at the next time from there.
・ RNN mathematical description
u^t = W_{ (in) }x^t + Wz^{ t-1 } + b \\
z^t = f(W_{ (in) }x^t + Wz^{ t-1 } + b)\\
v^t = W_{ out } z^t + c\\
y^t = g(W_{ out }z^t + c)\\
1-2 BPTT
・ A type of parameter adjustment method in RNN
⇒ A type of error back propagation
** Section2) LSTM overview (previous flow and vision of issues) ** 2-1 CEC I want to propagate the stored information of CEC to other nodes at any time, or forget it at any time. ⇒ How to solve gradient disappearance and gradient explosion. 2-2 Input gate and output gate By adding input / output gates, the weights of the input values to each gate can be changed by the weight matrices W and U. 2-3 Oblivion Gate If you no longer need the past information, delete it. 2-4 Peephole connection I want to propagate the stored information of CEC to other nodes at any time, or forget it at any time. ⇒Mask the CEC information.
Section3) GRU In the conventional LSTM, the calculation load was heavy because there were many parameters. However, in GRU, the parameters have been significantly reduced, and the accuracy can be expected to be equal to or higher than that.
** Section4) Bidirectional RNN ** A model for improving accuracy by adding future information as well as past information
** Section5) Seq2Seq Overview ** Seq2seq refers to a type of Encoder-Decoder model. It is used for machine dialogue and machine translation. 5-1 Encoder RNN A structure in which text data input by the user is divided into tokens such as words and passed. 5-2 Decoder RNN A structure in which the system generates output data for each token such as a word. 5-3 HRED You can reply with the history of past utterances. 5-4 VHRED HRED with the concept of VAE latent variables added. 5-5 VAE VAE has made it possible to push the data into a structure called the probability distribution of the latent variable z. 5-5-1 Autoencoder One of unsupervised learning. Therefore, the input data at the time of learning is only training data, and teacher data is not used.
Section6) Word2vec A vocabulary was created from the training data, and learning of distributed representation of large-scale data became feasible with a realistic calculation speed and amount of memory.
Section7) AttentionMechanism A mechanism for learning the degree of relevance of "which words in the input and output are related".
[P11] Answer the size of the output image when the input image of size 5x5 is folded with the filter of size 3x3. The stride is 2 and the padding is 1.
** ● Consideration: **
Answer: Height: (5 + 2 * 1-3) / 2 + 1 = 3
Width: (5 + 2 * 1-3) / 2 + 1 = 3
[P23] RNN networks can be broadly divided into three weights. One is the weight that is applied when defining the current middle layer from the input, and the other is the weight that is applied when defining the output from the middle layer. Explain the remaining one weight.
** ● Consideration: ** Answer: Weight used to recursively obtain (t-1) time state → current time (t) time state in the hidden layer
[P35] Find dz / dx using the principle of chain rule. z = t2 t = x + y
** ● Consideration: ** $ \frac{dz}{dx}=\frac{dz}{dt}\frac{dt}{dx} $ $ ⇒ \frac{dz}{dt} = 2t, \frac{dt}{dx} = 1 ,t=x+y $ Substitution result of the above formula: $ \frac{dz}{dx} = 2(x+y) $
[P44] Express y1 in the figure below as a mathematical formula using x, s0, s1, win, w, and wout.
** ● Consideration: **
z_{ 1 } = sigmoid(W_{ (in) }x_{ 1 } + Ws_{ 0 } + b ) \\
y_{ 1 } = sigmoid(W_{ (out) }z_{ 1 } + c ) \\
[P61] When the sigmoid function is differentiated, the maximum value is taken when the input value is 0. Select the correct value from the options.
(1)0.15
(2)0.25
(3)0.35
(4)0.4
** ● Consideration: **
$
sigmoid'(x) = sigmoid(x)(1-sigmoid(x))
$
$
sigmoid(0) = 0.5
$
Substitution result of the above formula:
$
sigmoid'(0) = 0.5*(1-0.5) = 0.25
$
Therefore, the correct answer is (2)
[P71] Suppose you want to enter the following sentence into the LSTM and predict the words that apply to the blanks. The word "very" in the text is not considered to have any effect if it disappears in the blank prediction. Which gate is considered to work in such a case?
"The movie was interesting. By the way, I'm so hungry that something ____."
** ● Consideration: ** The forgetting gate has a function of forgetting the information at the timing when the past information is no longer needed. ⇒ Therefore, the answer is the oblivion gate.
[P87] Briefly describe the issues facing LSTM and CEC.
** ● Consideration: ** ・ Issues of LSTM The LSTM has a problem that the number of parameters is large and the calculation load is high. ※solution The GRU significantly reduces the parameters of conventional LSTMs, and uses this structure to reduce the computational load with a structure that can be expected to have the same or higher accuracy. ・ CEC issues The weight of the input data is uniform regardless of the time dependence. ⇒There is no learning characteristic of neural network ※solution Peephole coupling is a structure that makes it possible to propagate the value of CEC itself via a weight matrix, and when the past information is no longer needed, the information is forgotten at that timing. )
[P91] Briefly describe the difference between LSTM and GRU.
** ● Consideration: ** -In GRU, the forgetting gate and the input gate are not clearly separated. ・ GRU has few parameters.
[P108] From the options below, select the one that describes seq2seq. (1) RNNs in the forward and reverse directions with respect to time are constructed, and these two intermediate layer representations are used as features. (2) A type of Encoder-Decoder model that uses RNN, and is used for models such as machine translation. (3) Syntax This is a neural network that recursively performs the operation of creating an expression vector (phrase) from adjacent words on a tree structure such as a tree (with the same weight) and obtains the expression vector of the entire sentence. (4) A type of RNN, which solves the vanishing gradient problem that is a problem in simple RNNs by introducing the concept of CEC and gate.
** ● Consideration: ** Answer: (2)
[P118] Briefly describe the difference between seq2seq and HRED, and HRED and VHRED. ** ● Consideration: ** -The difference between seq2seq and HRED is that in Seq2seq, there is no context for the question, just the response continues, and in HRED, the response follows the flow of the previous word. ・ The difference between HRED and VHRED is that HRED has the same output for the same input, but VHRED allows various outputs for the same input by adding a latent variable that is stochastic noise to the Context layer. ..
[P127] Answer the words that apply to the blanks in the explanation below regarding VAE. Introducing ____ into the latent variable of the self-encoder.
** ● Consideration: ** Answer: Probability distribution
[P136] Briefly describe the difference between RNN and word2vec, and seq2seq and Attention.
** ● Consideration: ** -The difference between RNN and word2vec is that RNN cannot give a variable length character string like a word to NN, and word2vec can represent a word in fixed length format. ・ The difference between seq2seq and Attention is that seq2seq is difficult to handle long sentences. Attention can learn the degree of relevance of "which words in the input and output are related", so it is easy to deal with long sentences.
================================================================================================================= Day4
** Section1) TensorFlow implementation exercise **
** Section2) Reinforcement learning ** ** 2-1 What is Reinforcement Learning ** A field of machine learning that aims to create agents who can choose actions in the environment so that rewards can be maximized in the long run. ⇒It is a mechanism to improve the principle of deciding an action based on the profit (reward) given as a result of the action. ** 2-2 Application example of reinforcement learning ** Environment: Company Promotion Department Agent: Software that determines which customers will send campaign emails based on their profile and purchase history. Action: You will have to choose between two actions, send and non-send, for each customer. Reward: Receive a negative reward of the cost of the campaign and a positive reward of the sales estimated to be generated by the campaign.
** 2-3 Trade-off between search and use ** With perfect knowledge of the environment in advance, it is possible to predict and determine optimal behavior. In the case of reinforcement learning, data is collected while acting based on incomplete knowledge. Find the best action. ・ In the past data, if you always take only the best behavior, you cannot find another best behavior. ・ If you keep taking only unknown actions, you cannot make use of your past experience. In this way, the above-mentioned trade-off 2 state is obtained, in which the search is insufficient and the usage is insufficient. ** 2-4 Image of reinforcement learning **
** 2-5 Reinforcement learning difference ** The difference between reinforcement learning and supervised and unsupervised learning. Conclusion: different goals ・ In unsupervised learning, the goal is to find patterns contained in the data and make predictions from the data. ・ In reinforcement learning, the goal is to find excellent measures ** 2-6 Action value function ** There are two types of action value functions, the state value function and the action value function, as functions that express value. When focusing on the value of a state, the state value function When focusing on the value that combines the state and value, the action value function.
** 2-7 policy function ** A policy function is a function that gives the probability of what action to take in a certain state in a policy-based reinforcement learning method. ** 2-8 Policy Gradient Method ** Policy Iterative Method A technique for modeling and optimizing strategies
\theta^{ t+1 } = \theta^{ t } + \epsilon \nabla J(\theta) \\
$ * Evaluate with the goodness J (\ theta) of the defined policy $
◆ Policy How to define the gradient method ・ Average reward ・ Discount reward sum
Recommended Posts