Deep Learning Specialization (Coursera) Self-study record (C2W2)

Introduction

This is the content of Course 2, Week 2 (C2W2) of Deep Learning Specialization.

(C2W2L01) Mini-batch gradient descent

Contents

-When $ m = 5000000 $, divide the training set into mini-batch and calculate forward propagation and back propagation for each mini-batch. --mini-batch converges faster

X^{\{1\}} = \left[ X^{(1)} \, X^{(2)} \, \cdots \, X^{(1000)}\right] \\
Y^{\{1\}} = \left[ Y^{(1)} \, Y^{(2)} \, \cdots \, Y^{(1000)}\right] \\
X^{\{2\}} = \left[ X^{(1001)} \, X^{(1002)} \, \cdots \, X^{(2000)}\right] \\
Y^{\{2\}} = \left[ Y^{(1001)} \, Y^{(1002)} \, \cdots \, Y^{(2000)}\right] 

(C2W2L02) Understanding Mini-batch Gradient Descent

Contents

--For mini-batch gradient descent, cost function $ J ^ {\ {t \}} $ oscillates and decreases with each mini-batch iteration --The appropriate size of mini-batch should be large enough to benefit from the efficiency of calculation by vectorization. -Typical sizes are $ 2 ^ 6 $, $ 2 ^ 7 $, $ 2 ^ 8 $, $ 2 ^ 9 $, etc. (powers of 2 to use memory efficiently) --Inefficient if the mini-batch size does not fit in the memory size of the CPU / GPU --Try some powers of 2 to find a size that can be calculated efficiently

(C2W2L03) Exponentially Weighted Average

Contents

--Exponentially weighted (moving) average (does Japanese match with exponentially weighted (moving) average?) --Convert the original data ($ \ theta_0 $, $ \ theta_1 $, $ \ cdots $)

V_0 = 0 \\
V_t = \beta V_{t-1} + \left( 1-\beta \right) \theta_t

-$ V_t $ can be regarded as the average of approximately $ \ frac {1} {1- \ beta} $ data -If $ \ beta $ is large, the data will be smooth because it will be calculated using more data. -If $ \ beta $ is small, it is noisy and sensitive to outliers.

(C2W2L04) Understanding exponentially weighted average

Contents

--How to implement exponentially weighted average

(C2W2L05) Bias correction in exponentially weighted average

Contents

--In exponentially weight average, $ V_t $ becomes very small at the initial stage.

V_0 = 0 \\
V_1 = 0.98 V_0 + 0.02 \theta_1 = 0.02 \theta_1 \\
V_2 = 0.98 V_1 + 0.02 \theta_2 = 0.0196\theta_1 + 0.02\theta_2

--Therefore, correct $ V_t $ with $ \ frac {V_t} {1- \ beta ^ t} $. When t becomes large, it becomes $ \ beta ^ t \ sim 0 $, and the effect of correction is almost lost. --In many cases, bias correction is not implemented (using data from the initial stage onwards)

(C2W2L06) Gradient descent with momentum

Contents

--In iteration $ t $ --Calculate $ dW $, $ db $ with current mini-batch

V_{dw} = \beta V_{dw} + \left( 1-\beta \right) dW \\
V_{db} = \beta V_{db} + \left( 1-\beta \right) db \\
W := W - \alpha V_{dW} \\
b := b - \alpha V_{db}

--Smooth the vibration of the steepest descent method -$ \ beta $ and $ \ alpha $ are hyperparameters, but $ \ beta = 0.9 $ is good --Bias correction is rarely used. Repeat 10 times and it will be $ \ beta ^ t \ sim 0 $ -There is also a document that sets $ V_ {dW} = \ beta V_ {dW} + dW $. In this case, it can be considered that $ \ alpha $ is scaled by $ \ frac {1} {1- \ beta} $.

(C2W2L07) RMSProp

Contents

S_{dW} = \beta S_{dW} + \left( 1-\beta \right) dW^2 \ (\textrm{Element by element}) \\
S_{db} = \beta S_{db} + \left( 1-\beta \right) db^2 \ (\textrm{Element by element}) \\
W := W -\alpha \frac{dW}{\sqrt{S_{dW}} + \epsilon} \\
b := b -\alpha \frac{db}{\sqrt{S_{db}} + \epsilon} \\

--Insert $ \ epsilon = 10 ^ {-8} $ so that the denominator does not become 0

(C2W2L08) Adam optimization algorithm

Contents

V_{dw} = \beta_1 V_{dw} + \left( 1-\beta_1 \right) dW \\
V_{db} = \beta_1 V_{db} + \left( 1-\beta_1 \right) db \\
S_{dW} = \beta_2 S_{dW} + \left( 1-\beta_2 \right) dW^2  \\
S_{db} = \beta_2 S_{db} + \left( 1-\beta_2 \right) db^2  \\
V^{corrected}_{dW} = \frac{V_{dw}}{1-\beta_1^t} \\
V^{corrected}_{db} = \frac{V_{db}}{1-\beta_1^t} \\
S^{corrected}_{dW} = \frac{S_{dw}}{1-\beta_2^t} \\
S^{corrected}_{db} = \frac{S_{db}}{1-\beta_2^t} \\
W := W -\alpha \frac{V^{corrected}_{dW}}{\sqrt{S^{corrected}_{dW}}+\epsilon} \\
b := b -\alpha \frac{V^{corrected}_{db}}{\sqrt{S^{corrected}_{db}}+\epsilon} \\

--Hyper parameters - \alpha ; needs to be tuned - \beta_1 ; 0.9 - \beta_2 ; 0.999 -$ \ epsilon $; $ 10 ^ {-8} $ (doesn't affect much, but usually $ 10 ^ {-8} $)

(C2W2L09) Learning rate decay

Contents

--In mini-batch, if $ \ alpha $ is constant, it will not converge. If you gradually reduce $ \ alpha $, it will fit near the minimum value. --epoch; 1 pass through data (When divided into mini-batch, the unit that handles all mini-batch data is called epoch)

\alpha = \frac{1}{1 + \textrm{decay_rate} \ast \textrm{epoch_num}} \alpha_0

--Other methods include the following

\alpha = 0.95^{\textrm{epoch_num}} \alpha_0\\
\alpha = \frac{k}{\sqrt{\textrm{epoch_num}}} \alpha_0

reference

-Deep Learning Specialization (Coursera) Self-study record (table of contents)

Recommended Posts

Deep Learning Specialization (Coursera) Self-study record (C3W1)
Deep Learning Specialization (Coursera) Self-study record (C1W3)
Deep Learning Specialization (Coursera) Self-study record (C4W3)
Deep Learning Specialization (Coursera) Self-study record (C1W4)
Deep Learning Specialization (Coursera) Self-study record (C2W1)
Deep Learning Specialization (Coursera) Self-study record (C3W2)
Deep Learning Specialization (Coursera) Self-study record (C2W2)
Deep Learning Specialization (Coursera) Self-study record (C4W1)
Deep Learning Specialization (Coursera) Self-study record (C2W3)
Deep Learning Specialization (Coursera) Self-study record (C4W2)
Learning record
Learning record # 3
Learning record # 1
Learning record # 2
Deep Learning
Learning record of reading "Deep Learning from scratch"
"Deep Learning from scratch" Self-study memo (Part 12) Deep learning
"Deep Learning from scratch" Self-study memo (9) MultiLayerNet class
Deep Learning Memorandum
Start Deep learning
Python Deep Learning
"Deep Learning from scratch" Self-study memo (10) MultiLayerNet class
"Deep Learning from scratch" Self-study memo (No. 11) CNN
"Deep Learning from scratch" Self-study memo (No. 19) Data Augmentation
"Deep Learning from scratch 2" Self-study memo (No. 21) Chapters 3 and 4
Learning record so far
First Deep Learning ~ Struggle ~
Python: Deep Learning Practices
Go language learning record
Deep learning / activation functions
Deep Learning from scratch
Learning record 4 (8th day)
Learning record 9 (13th day)
Learning record 3 (7th day)
Deep learning 1 Practice of deep learning
Deep learning / cross entropy
Learning record 6 (10th day)
First Deep Learning ~ Preparation ~
Programming learning record day 2
First Deep Learning ~ Solution ~
Learning record 8 (12th day)
[AI] Deep Metric Learning
Learning record 1 (4th day)
Learning record 7 (11th day)
I tried deep learning
Python: Deep Learning Tuning
Learning record 2 (6th day)
Deep learning large-scale technology
Linux learning record ① Plan
Learning record 16 (20th day)
Learning record 22 (26th day)
Deep learning / softmax function
"Deep Learning from scratch" self-study memo (No. 18) One! Meow! Grad-CAM!
"Deep Learning from scratch" self-study memo (No. 19-2) Data Augmentation continued
"Deep Learning from scratch" self-study memo (No. 15) TensorFlow beginner tutorial