Introduction

This is the content of Course 1, Week 3 (C1W3) of Deep Learning Specialization.

(C1W3L01) Newral Network Overview

--Week 3 explains the implementation of neural network --About the first layer of Neural Network -$ W ^ {[1]} $, $ b ^ {[1]} $; Parameters - z^{[1]} = W^{[1]} x + b^{[1]} - a^{[1]} = \sigma(z^{[1]}) --About the second layer of Neural Network - z^{[2]} = W^{[2]} a^{[1]} + b^{[2]} - a^{[2]} = \sigma(z^{[2]}) -Calculate $ L (a ^ {[2]}, y) $ --back propagation - da^{[2]} → dz^{[2]} → dW^{[2]}, db^{[2]} → …

(C1W3L02) Neural Network Representation

--Explanation of single hidden layer (= 2 layers neural network, when counting layers, input layer is not counted, hidden layer and output layer are counted)

input layer ; x = a^{[0]}
hidden layer --Parameters are $ w ^ {[1]} $ ((4, 3) matrix) and $ b ^ {[1]} $ ((4, 1) matrix) -$ a ^ {[1]} $ has 4 nodes
output layer --Parameters are $ w ^ {[2]} $ ((1, 4) matrix) and $ b ^ {[1]} $ ((1, 1) matrix)
- \hat{y} = a^{[2]}

(C1W3L03) Computing a Neural Network Output

--Explanation of how to calculate neural network -$ a_i ^ {[l]} $; $ i $ th node of $ l $ layer --Vectorize and calculate

z^{[1]} = W^{[1]} x + b^{[1]} \\
a^{[1]} = \sigma(z^{[1]}) \\
z^{[2]} = W^{[2]} a^{[1]} + b^{[2]} \\
a^{[2]} = \sigma(z^{[2]}) \\

(C1W3L04) Vectorizing Across Multiple Examples

--How to calculate multiple training examples -$ X = \ [x ^ {(1)} , x ^ {(2)} , \ cdots x ^ {(m)} ] $ ($ (n_x, m) $ matrix, $ m $ Is the number of training examples)

Z^{[1]} = W^{[1]} X + b^{[1]} \\
A^{[1]} = \sigma\left(Z^{[1]}\right) \\
Z^{[2]} = W^{[2]} Z^{[1]} + b^{[2]} \\
A^{[2]} = \sigma\left(Z^{[2]}\right)

-$ Z ^ {[1]} $, $ A ^ {[1]} $ --Lines; number of hidden units --Column; $ m $

Z^{[1]} = \left[ z^{[1](1)}\,z^{[1](2)}\,\cdotsz^{[1](m)} \right] \\
A^{[1]} = \left[ a^{[1](1)}\,a^{[1](2)}\,\cdotsa^{[1](m)} \right]

Impressions

--Explaining very slowly and politely. This is important, and if you stumble, you will be in great trouble later.

(C1W3L05) Explanation For Vectorized Implementation

X = \left[x^{(1)} \, x^{(2)} \, \cdots x^{(m)}\right] \\
Z^{[1]} = \left[z^{[1](1)}\,z^{[1](2)}\,\cdotsz^{[1](m)}\right] \\
Z^{[1]} = W^{[1]} X + b^{[1]}

-$ b ^ {[1]} $ becomes a matrix using Python broadcasts

(C1W3L06) Activation functions

--sigmoid function - a = \frac{1}{1+e^{-z}} --Used only for binary classification

--tanh function - a = \tanh z = \frac{e^z - e^{-z}}{e^z + e^{-z}} --Better than the sigmoid function. Because the average value will be zero. --However, the common drawback of the sigmoid and tanh functions is that the slope approaches 0 where $ z $ is large, which slows the convergence of the steepest descent method.

--ReLU function - a = \max$0, z$ -The derivative cannot be defined with $ z = 0 $, but there is no problem because it is not exactly $ z = 0 $ at the time of calculation. --ReLU is used by default in neural netwok (sometimes tanh) -The disadvantage is that the slope becomes 0 at $ z \ lt 0 $

Leaky ReLU
- a = \max(0.01z, z) -Although $ z \ lt 0 $ has a slight inclination --0.01 can be considered as one of the learning parameters, but few people implement it. --The activation function may be changed depending on the layer (hidden layer is tanh, output layer is sigmoid, etc.) --There are many choices for Neural networks (type of activation function, parameter initialization method, etc.), but it is difficult to provide guidelines.

(C1W3L07) Why do you need non-linear activation function

――Why use a non-linear function for the activation function? → If you make it a linear function, even if you increase the hidden layer, it will only be a linear function after all, so it is useless.

(C1W3L08) Derivatives of activation functions

sigmoid activation function

g(z) = \frac{1}{1+e^{-z}} \\
g^\prime(z) = g(z) \left( 1-g(z) \right)

Tanh activation function

g(z) = \tanh (z) \\
g^\prime(z) = 1-\left( \tanh(z) \right)^2

ReLU

g(z) = \max\left(0, z\right) \\
g^\prime(z) = 0 \ (\text{if}\  z \lt 0) \\
g^\prime(z) = 1 \ (\text{if}\  z \ge 0)

Leaky ReLU

g(z) = \max\left(0.01z, z\right) \\
g^\prime(z) = 0.01 \ (\textrm{if}\  z \lt 0) \\
g^\prime(z) = 1 \ (\textrm{if}\  z \ge 0)

--The derivative of $ z = 0 $ for ReLU and Leaky ReLU can be 0, 1 or indefinite (because the probability of exactly $ z = 0 $ during the calculation is low).

(C1W3L09) Gradient descent for neural network

-$ n ^ {[0]} = n_x, n ^ {[1]}, n ^ {[2]} (= 1) $ --Parameters are $ W ^ {[1]} $ ($ (n ^ {[1]}, n ^ {[0]}) $ matrix), $ b ^ {[1]} $ ($ (n ^ {) [1]}, 1) $ matrix), $ W ^ {[2]} $ ($ (n ^ {[2]}, n ^ {[1]}) $ matrix), $ b ^ {[2] } $ ($ (n ^ {[2]}, 1) $ matrix)

cost function ; J(W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]}) = \frac{1}{m}\Sigma_{i=1}^{m}L(\hat{y}, y) --Forword propagation (output layer is binary classification → sigmoid function)

Z^{[1]} = W^{[1]} X + b^{[1]} \\
A^{[1]} = g^{[1]}\left( Z^{[1]} \right) \\
Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]} \\
A^{[2]} = g^{[2]}\left( Z^{[2]} \right) = \sigma \left( Z^{[2]} \right)

-backpropagation

dZ^{[2]} = A^{[2]} - Y \ \ \left( Y = \left[ y^{(1)} \, y^{(2)} \, \cdots y^{(m)} \right] \right) \\
dW^{[2]} = \frac{1}{m} dZ^{[2]} A^{[1]\textrm{T}}\\
db^{[2]} = \frac{1}{m} \textrm{np.sum} \left( dZ^{[2]} \textrm{, axis=1, keepdims=True} \right)\\

dZ^{[1]} = W^{[2]\textrm{T}}dZ^{[2]} \ast g^{[1]\prime} \left(Z^{[1]}\right) \\
dW^{[1]} = \frac{1}{m}dZ^{[1]} X^{\text{T}} \\
db^{[1]} = \frac{1}{m} \textrm{np.sum} \left( dZ^{[1]} \textrm{, axis=1, keepdims=True} \right)\\

--If you do not add keepdims = True``` to np.sum```, it becomes a $ (n ^ {[i]},) $ vector. With `` keepdims = True, it becomes a $ (n ^ {[i]}, 1) $ vector. --If you do not add keepdims = True, do` `reshape -$ \ ast $ in the expression $ dZ ^ {[1]} $ is the product of each element

Impressions

--The tips of `` `np.sum``` are casually interwoven (it is important to be aware of the dimension).

(C1W3L10) Backpropagation Intuition (optional)

--Intuitive explanation of vectorized implementation of backpropagation of logistic regression ―― "The most mathematically difficult part of neural network"

(C1W3L11) Random Initialization

--In case of logistic regression, it is OK to initialize the weight with 0 --In a neural network, initializing the weight W with 0 is NG --Assuming that the elements of $ W ^ {[1]} $ are all 0 and the elements of $ b ^ {[1]} $ are all 0, the same calculation is performed regardless of the number of hidden layer units. become. In that case, there is no point in having multiple units, and it will be the same as if there was only one unit. --Initialization method

W^{[1]} = \textrm{np.random.randn(2, 2)} \ast 0.01 \\
b^{[1]} = \textrm{np.zero((2, 1))}

-$ b ^ {[1]} $ can be 0 because the symmetry is broken if $ W ^ {[1]} $ is initialized randomly. --The initial value of $ W $ should be small. When $ W $ is large, $ Z = Wx + b $ becomes large, but when the value of sigmoid and $ \ tanh $ is large, the slope becomes small and the learning speed of the steepest descent method becomes slow. --For a shallow neural network such as one hidden layer, 0.01 is OK. For deep neural networks, it may be a value other than 0.01

reference

-Deep Learning Specialization (Coursera) Self-study record (table of contents)

Deep Learning Specialization (Coursera) Self-study record (C1W3)

Introduction

Contents

Contents

Contents

Contents

Impressions

Contents

Contents

Contents

Contents

Contents

Impressions

Contents

Contents

reference