This is the content of Course 1, Week 3 (C1W3) of Deep Learning Specialization.
(C1W3L01) Newral Network Overview
--Week 3 explains the implementation of neural network
--About the first layer of Neural Network
-$ W ^ {[1]} $, $ b ^ {[1]} $; Parameters
-
(C1W3L02) Neural Network Representation
--Explanation of single hidden layer (= 2 layers neural network, when counting layers, input layer is not counted, hidden layer and output layer are counted)
(C1W3L03) Computing a Neural Network Output
--Explanation of how to calculate neural network -$ a_i ^ {[l]} $; $ i $ th node of $ l $ layer --Vectorize and calculate
z^{[1]} = W^{[1]} x + b^{[1]} \\
a^{[1]} = \sigma(z^{[1]}) \\
z^{[2]} = W^{[2]} a^{[1]} + b^{[2]} \\
a^{[2]} = \sigma(z^{[2]}) \\
(C1W3L04) Vectorizing Across Multiple Examples
--How to calculate multiple training examples -$ X = \ [x ^ {(1)} , x ^ {(2)} , \ cdots x ^ {(m)} ] $ ($ (n_x, m) $ matrix, $ m $ Is the number of training examples)
Z^{[1]} = W^{[1]} X + b^{[1]} \\
A^{[1]} = \sigma\left(Z^{[1]}\right) \\
Z^{[2]} = W^{[2]} Z^{[1]} + b^{[2]} \\
A^{[2]} = \sigma\left(Z^{[2]}\right)
-$ Z ^ {[1]} $, $ A ^ {[1]} $ --Lines; number of hidden units --Column; $ m $
Z^{[1]} = \left[ z^{[1](1)}\,z^{[1](2)}\,\cdotsz^{[1](m)} \right] \\
A^{[1]} = \left[ a^{[1](1)}\,a^{[1](2)}\,\cdotsa^{[1](m)} \right]
--Explaining very slowly and politely. This is important, and if you stumble, you will be in great trouble later.
(C1W3L05) Explanation For Vectorized Implementation
X = \left[x^{(1)} \, x^{(2)} \, \cdots x^{(m)}\right] \\
Z^{[1]} = \left[z^{[1](1)}\,z^{[1](2)}\,\cdotsz^{[1](m)}\right] \\
Z^{[1]} = W^{[1]} X + b^{[1]}
-$ b ^ {[1]} $ becomes a matrix using Python broadcasts
(C1W3L06) Activation functions
--sigmoid function
-
--tanh function
-
--ReLU function
-
(C1W3L07) Why do you need non-linear activation function
――Why use a non-linear function for the activation function? → If you make it a linear function, even if you increase the hidden layer, it will only be a linear function after all, so it is useless.
(C1W3L08) Derivatives of activation functions
g(z) = \frac{1}{1+e^{-z}} \\
g^\prime(z) = g(z) \left( 1-g(z) \right)
g(z) = \tanh (z) \\
g^\prime(z) = 1-\left( \tanh(z) \right)^2
g(z) = \max\left(0, z\right) \\
g^\prime(z) = 0 \ (\text{if}\ z \lt 0) \\
g^\prime(z) = 1 \ (\text{if}\ z \ge 0)
g(z) = \max\left(0.01z, z\right) \\
g^\prime(z) = 0.01 \ (\textrm{if}\ z \lt 0) \\
g^\prime(z) = 1 \ (\textrm{if}\ z \ge 0)
--The derivative of $ z = 0 $ for ReLU and Leaky ReLU can be 0, 1 or indefinite (because the probability of exactly $ z = 0 $ during the calculation is low).
(C1W3L09) Gradient descent for neural network
-$ n ^ {[0]} = n_x, n ^ {[1]}, n ^ {[2]} (= 1) $ --Parameters are $ W ^ {[1]} $ ($ (n ^ {[1]}, n ^ {[0]}) $ matrix), $ b ^ {[1]} $ ($ (n ^ {) [1]}, 1) $ matrix), $ W ^ {[2]} $ ($ (n ^ {[2]}, n ^ {[1]}) $ matrix), $ b ^ {[2] } $ ($ (n ^ {[2]}, 1) $ matrix)
Z^{[1]} = W^{[1]} X + b^{[1]} \\
A^{[1]} = g^{[1]}\left( Z^{[1]} \right) \\
Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]} \\
A^{[2]} = g^{[2]}\left( Z^{[2]} \right) = \sigma \left( Z^{[2]} \right)
-backpropagation
dZ^{[2]} = A^{[2]} - Y \ \ \left( Y = \left[ y^{(1)} \, y^{(2)} \, \cdots y^{(m)} \right] \right) \\
dW^{[2]} = \frac{1}{m} dZ^{[2]} A^{[1]\textrm{T}}\\
db^{[2]} = \frac{1}{m} \textrm{np.sum} \left( dZ^{[2]} \textrm{, axis=1, keepdims=True} \right)\\
dZ^{[1]} = W^{[2]\textrm{T}}dZ^{[2]} \ast g^{[1]\prime} \left(Z^{[1]}\right) \\
dW^{[1]} = \frac{1}{m}dZ^{[1]} X^{\text{T}} \\
db^{[1]} = \frac{1}{m} \textrm{np.sum} \left( dZ^{[1]} \textrm{, axis=1, keepdims=True} \right)\\
--If you do not add keepdims = True``` to
np.sum```, it becomes a $ (n ^ {[i]},) $ vector. With ``
keepdims = True, it becomes a $ (n ^ {[i]}, 1) $ vector. --If you do not add
keepdims = True, do` `reshape
-$ \ ast $ in the expression $ dZ ^ {[1]} $ is the product of each element
--The tips of `` `np.sum``` are casually interwoven (it is important to be aware of the dimension).
(C1W3L10) Backpropagation Intuition (optional)
--Intuitive explanation of vectorized implementation of backpropagation of logistic regression ―― "The most mathematically difficult part of neural network"
(C1W3L11) Random Initialization
--In case of logistic regression, it is OK to initialize the weight with 0 --In a neural network, initializing the weight W with 0 is NG --Assuming that the elements of $ W ^ {[1]} $ are all 0 and the elements of $ b ^ {[1]} $ are all 0, the same calculation is performed regardless of the number of hidden layer units. become. In that case, there is no point in having multiple units, and it will be the same as if there was only one unit. --Initialization method
W^{[1]} = \textrm{np.random.randn(2, 2)} \ast 0.01 \\
b^{[1]} = \textrm{np.zero((2, 1))}
-$ b ^ {[1]} $ can be 0 because the symmetry is broken if $ W ^ {[1]} $ is initialized randomly. --The initial value of $ W $ should be small. When $ W $ is large, $ Z = Wx + b $ becomes large, but when the value of sigmoid and $ \ tanh $ is large, the slope becomes small and the learning speed of the steepest descent method becomes slow. --For a shallow neural network such as one hidden layer, 0.01 is OK. For deep neural networks, it may be a value other than 0.01
-Deep Learning Specialization (Coursera) Self-study record (table of contents)
Recommended Posts