Deep Learning Specialization (Coursera) Self-study record (C1W3)

Introduction

This is the content of Course 1, Week 3 (C1W3) of Deep Learning Specialization.

(C1W3L01) Newral Network Overview

Contents

--Week 3 explains the implementation of neural network --About the first layer of Neural Network -$ W ^ {[1]} $, $ b ^ {[1]} $; Parameters - z^{[1]} = W^{[1]} x + b^{[1]} - a^{[1]} = \sigma(z^{[1]}) --About the second layer of Neural Network - z^{[2]} = W^{[2]} a^{[1]} + b^{[2]} - a^{[2]} = \sigma(z^{[2]}) -Calculate $ L (a ^ {[2]}, y) $ --back propagation - da^{[2]}dz^{[2]}dW^{[2]}, db^{[2]} → …

(C1W3L02) Neural Network Representation

Contents

--Explanation of single hidden layer (= 2 layers neural network, when counting layers, input layer is not counted, hidden layer and output layer are counted)

(C1W3L03) Computing a Neural Network Output

Contents

--Explanation of how to calculate neural network -$ a_i ^ {[l]} $; $ i $ th node of $ l $ layer --Vectorize and calculate

z^{[1]} = W^{[1]} x + b^{[1]} \\
a^{[1]} = \sigma(z^{[1]}) \\
z^{[2]} = W^{[2]} a^{[1]} + b^{[2]} \\
a^{[2]} = \sigma(z^{[2]}) \\

(C1W3L04) Vectorizing Across Multiple Examples

Contents

--How to calculate multiple training examples -$ X = \ [x ^ {(1)} , x ^ {(2)} , \ cdots x ^ {(m)} ] $ ($ (n_x, m) $ matrix, $ m $ Is the number of training examples)

Z^{[1]} = W^{[1]} X + b^{[1]} \\
A^{[1]} = \sigma\left(Z^{[1]}\right) \\
Z^{[2]} = W^{[2]} Z^{[1]} + b^{[2]} \\
A^{[2]} = \sigma\left(Z^{[2]}\right)

-$ Z ^ {[1]} $, $ A ^ {[1]} $ --Lines; number of hidden units --Column; $ m $

Z^{[1]} = \left[ z^{[1](1)}\,z^{[1](2)}\,\cdotsz^{[1](m)} \right] \\
A^{[1]} = \left[ a^{[1](1)}\,a^{[1](2)}\,\cdotsa^{[1](m)} \right]

Impressions

--Explaining very slowly and politely. This is important, and if you stumble, you will be in great trouble later.

(C1W3L05) Explanation For Vectorized Implementation

Contents

X = \left[x^{(1)} \, x^{(2)} \, \cdots x^{(m)}\right] \\
Z^{[1]} = \left[z^{[1](1)}\,z^{[1](2)}\,\cdotsz^{[1](m)}\right] \\
Z^{[1]} = W^{[1]} X + b^{[1]}

-$ b ^ {[1]} $ becomes a matrix using Python broadcasts

(C1W3L06) Activation functions

Contents

--sigmoid function - a = \frac{1}{1+e^{-z}} --Used only for binary classification

--tanh function - a = \tanh z = \frac{e^z - e^{-z}}{e^z + e^{-z}} --Better than the sigmoid function. Because the average value will be zero. --However, the common drawback of the sigmoid and tanh functions is that the slope approaches 0 where $ z $ is large, which slows the convergence of the steepest descent method.

--ReLU function - a = \max\(0, z\) -The derivative cannot be defined with $ z = 0 $, but there is no problem because it is not exactly $ z = 0 $ at the time of calculation. --ReLU is used by default in neural netwok (sometimes tanh) -The disadvantage is that the slope becomes 0 at $ z \ lt 0 $

(C1W3L07) Why do you need non-linear activation function

Contents

――Why use a non-linear function for the activation function? → If you make it a linear function, even if you increase the hidden layer, it will only be a linear function after all, so it is useless.

(C1W3L08) Derivatives of activation functions

Contents

g(z) = \frac{1}{1+e^{-z}} \\
g^\prime(z) = g(z) \left( 1-g(z) \right)
g(z) = \tanh (z) \\
g^\prime(z) = 1-\left( \tanh(z) \right)^2
g(z) = \max\left(0, z\right) \\
g^\prime(z) = 0 \ (\text{if}\  z \lt 0) \\
g^\prime(z) = 1 \ (\text{if}\  z \ge 0)
g(z) = \max\left(0.01z, z\right) \\
g^\prime(z) = 0.01 \ (\textrm{if}\  z \lt 0) \\
g^\prime(z) = 1 \ (\textrm{if}\  z \ge 0)

--The derivative of $ z = 0 $ for ReLU and Leaky ReLU can be 0, 1 or indefinite (because the probability of exactly $ z = 0 $ during the calculation is low).

(C1W3L09) Gradient descent for neural network

Contents

-$ n ^ {[0]} = n_x, n ^ {[1]}, n ^ {[2]} (= 1) $ --Parameters are $ W ^ {[1]} $ ($ (n ^ {[1]}, n ^ {[0]}) $ matrix), $ b ^ {[1]} $ ($ (n ^ {) [1]}, 1) $ matrix), $ W ^ {[2]} $ ($ (n ^ {[2]}, n ^ {[1]}) $ matrix), $ b ^ {[2] } $ ($ (n ^ {[2]}, 1) $ matrix)

Z^{[1]} = W^{[1]} X + b^{[1]} \\
A^{[1]} = g^{[1]}\left( Z^{[1]} \right) \\
Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]} \\
A^{[2]} = g^{[2]}\left( Z^{[2]} \right) = \sigma \left( Z^{[2]} \right) 

-backpropagation

dZ^{[2]} = A^{[2]} - Y \ \ \left( Y = \left[ y^{(1)} \, y^{(2)} \, \cdots y^{(m)} \right] \right) \\
dW^{[2]} = \frac{1}{m} dZ^{[2]} A^{[1]\textrm{T}}\\
db^{[2]} = \frac{1}{m} \textrm{np.sum} \left( dZ^{[2]} \textrm{, axis=1, keepdims=True} \right)\\

dZ^{[1]} = W^{[2]\textrm{T}}dZ^{[2]} \ast g^{[1]\prime} \left(Z^{[1]}\right) \\
dW^{[1]} = \frac{1}{m}dZ^{[1]} X^{\text{T}} \\
db^{[1]} = \frac{1}{m} \textrm{np.sum} \left( dZ^{[1]} \textrm{, axis=1, keepdims=True} \right)\\

--If you do not add keepdims = True``` to np.sum```, it becomes a $ (n ^ {[i]},) $ vector. With `` keepdims = True, it becomes a $ (n ^ {[i]}, 1) $ vector. --If you do not add keepdims = True, do` `reshape -$ \ ast $ in the expression $ dZ ^ {[1]} $ is the product of each element

Impressions

--The tips of `` `np.sum``` are casually interwoven (it is important to be aware of the dimension).

(C1W3L10) Backpropagation Intuition (optional)

Contents

--Intuitive explanation of vectorized implementation of backpropagation of logistic regression ―― "The most mathematically difficult part of neural network"

(C1W3L11) Random Initialization

Contents

--In case of logistic regression, it is OK to initialize the weight with 0 --In a neural network, initializing the weight W with 0 is NG --Assuming that the elements of $ W ^ {[1]} $ are all 0 and the elements of $ b ^ {[1]} $ are all 0, the same calculation is performed regardless of the number of hidden layer units. become. In that case, there is no point in having multiple units, and it will be the same as if there was only one unit. --Initialization method

W^{[1]} = \textrm{np.random.randn(2, 2)} \ast 0.01 \\
b^{[1]} = \textrm{np.zero((2, 1))}

-$ b ^ {[1]} $ can be 0 because the symmetry is broken if $ W ^ {[1]} $ is initialized randomly. --The initial value of $ W $ should be small. When $ W $ is large, $ Z = Wx + b $ becomes large, but when the value of sigmoid and $ \ tanh $ is large, the slope becomes small and the learning speed of the steepest descent method becomes slow. --For a shallow neural network such as one hidden layer, 0.01 is OK. For deep neural networks, it may be a value other than 0.01

reference

-Deep Learning Specialization (Coursera) Self-study record (table of contents)

Recommended Posts

Deep Learning Specialization (Coursera) Self-study record (C3W1)
Deep Learning Specialization (Coursera) Self-study record (C1W3)
Deep Learning Specialization (Coursera) Self-study record (C4W3)
Deep Learning Specialization (Coursera) Self-study record (C1W4)
Deep Learning Specialization (Coursera) Self-study record (C2W1)
Deep Learning Specialization (Coursera) Self-study record (C1W2)
Deep Learning Specialization (Coursera) Self-study record (C3W2)
Deep Learning Specialization (Coursera) Self-study record (C2W2)
Deep Learning Specialization (Coursera) Self-study record (C4W1)
Deep Learning Specialization (Coursera) Self-study record (C2W3)
Deep Learning Specialization (Coursera) Self-study record (C4W2)
Learning record
Learning record # 3
Learning record # 1
Learning record # 2
Deep Learning
Learning record of reading "Deep Learning from scratch"
"Deep Learning from scratch" Self-study memo (Part 12) Deep learning
"Deep Learning from scratch" self-study memo (unreadable glossary)
"Deep Learning from scratch" Self-study memo (9) MultiLayerNet class
Deep Learning Memorandum
Start Deep learning
Python Deep Learning
Deep learning × Python
"Deep Learning from scratch" Self-study memo (10) MultiLayerNet class
"Deep Learning from scratch" Self-study memo (No. 11) CNN
"Deep Learning from scratch" Self-study memo (No. 19) Data Augmentation
"Deep Learning from scratch 2" Self-study memo (No. 21) Chapters 3 and 4
Learning record so far
First Deep Learning ~ Struggle ~
Python: Deep Learning Practices
Go language learning record
Deep learning / activation functions
Deep Learning from scratch
Learning record 4 (8th day)
Learning record 9 (13th day)
Learning record 3 (7th day)
Deep learning 1 Practice of deep learning
Deep learning / cross entropy
Learning record 5 (9th day)
Learning record 6 (10th day)
First Deep Learning ~ Preparation ~
Programming learning record day 2
First Deep Learning ~ Solution ~
Learning record 8 (12th day)
[AI] Deep Metric Learning
Learning record 1 (4th day)
Learning record 7 (11th day)
I tried deep learning
Python: Deep Learning Tuning
Learning record 2 (6th day)
Deep learning large-scale technology
Linux learning record ① Plan
Learning record 16 (20th day)
Learning record 22 (26th day)
Deep learning / softmax function
"Deep Learning from scratch" self-study memo (No. 18) One! Meow! Grad-CAM!
"Deep Learning from scratch" self-study memo (No. 19-2) Data Augmentation continued
"Deep Learning from scratch" self-study memo (No. 15) TensorFlow beginner tutorial