Deep Learning Specialization (Coursera) Self-study record (C2W1)

Introduction

This is the content of Course 2, Week 1 (C2W1) of Deep Learning Specialization.

(C2W1L01) Train / Dev / Test sets

Contents

--Applied ML is a highly iterative process. It is important to efficiently rotate the cycle of Idea → Code → Experiment → Idea…

(C2W1L02) Bias / Variance

Contents

――High bias and high variance can be visualized in 2D, but not in high dimension.

train set error dev set error
1% 11% high variance
15% 16% high bias
15% 30% high bias & high variance
0.5% 1% low bias & low variance

--When the error (optimal error or Bayes error) when judged by a person is set to 0

(C2W1L03) Basic "recipe" for machine learning

Contents

--In case of high bias (check with training data performance) - bigger network - train longer -(NN architecture search) (may not be useful) --Repeat until high bias is resolved --For high variance (check with dev set performance) - more data - regularization -(NN architecture search) (may not be useful) --Bias and variance trade-offs were a problem in early neural networks --Now that we have more data, we can improve the variance without deteriorating the bias.

(C2W1L04) Regularization

Contents

J\left(w, b\right) = \frac{1}{m} \sum^{m}_{i=1}L\left(\hat{y}^{(i)}, y^{(i)}\right) + \frac{\lambda}{2m}\|w\|^2_2

-$ \ lambda $; Normalization parameter (one of hyperparameters)

J\left(w^{[1]}, b^{[1]}, \cdots , w^{[L]}, b^{[l]}\right) = \frac{1}{m} \sum^{m}_{i=1} L\left(\hat{y}^{(i)}, y^{(i)}\right) + \frac{\lambda}{2m} \sum^{L}_{l=1} \|w^{[l]}\|^2\\
\|w^{[l]}\|^2 = \sum^{n^{[l-1]}}_{i=1}\sum^{n^{[l]}}_{j=1}\left(w_{ij}^{[l]}\right)^2

-The dimension of $ w $ is $ (n ^ {[l-1]}, n ^ {[l]}) $

dw^{[l]} = \left( \textrm{from backprop} \right) + \frac{\lambda}{m}w^{[l]} \\
w^{[l]} = w^{[l]} - \alpha dw^{[l]} = \left(1 - \alpha \frac{\lambda}{m} \right)w^{[l]} - \alpha \left( \textrm{from backprop} \right)

--It is called weight decay because regularization makes $ dw ^ {[l]} $ smaller.

(C2W1L05) Why Regularization Reduces Overfitting

Contents

-If $ \ lambda $ is large, it will be $ w ^ {[l]} \ sim 0 $. Then, the influence of the hidden unit can be reduced, and it is considered that the network became simple. So get closer to high bias -Large $ \ lambda $ is closer to logistic regression -If $ g (z) = \ tanh (z) $, if $ z $ is small, the linear region of $ g (z) $ will be used. --When the activation function can be regarded as a linear function, it becomes impossible to represent a complex network. Therefore, it approaches high bias --When using the steepest descent method to confirm that $ J $ is getting smaller for each iteration, calculate $ J $ including the second term.

(C2W1L06) Dropout Regularization

Contents

--Drop out each unit with a certain probability (drop the unit) --Train with a reduced neural network -Assuming $ l = 3 $ (layer 3). Let keep_prob (= 0.8) be the probability of survival (the probability that 1-keep_prob will drop out). If dropout vector is $ d3 $

d3 = \mathrm{np.random.rand(} a3 \mathrm{.shape[0], }\, a3 \mathrm{.shape[1])} < \mathrm{keep\_prob} \\
a3 = \mathrm{np.multiply(} a3, d3 \mathrm{)} \\
a3\  /= \mathrm{keep\_prob} \\
a^{[4]} = W^{[4]} a^{[3]} + b^{[4]}

--Finally, keep the expected value of $ a ^ {[3]} $ by dividing by keep_prob. --Change dropout vector $ d3 $ for each iteration of steepest descent --Do not implement dropout when calculating test set (if dropout is included in the test, it will be noisy)

(C2W1L07) Understanding dropout

Contents

(C2W1L08) Other Regularization Methods

Contents

(C2W1L09) Normalizing inputs

Contents

--Normalize the input feature when the scale of the input feature is significantly different. By doing so, the steepest descent method can be calculated quickly.

\mu = \frac{1}{m} \sum^{m}_{i=1} x^{(i)} \\
x := x - \mu \\
\sigma^2 = \frac{2}{m} \sum^{m}_{i=1} x^{(i)} \ast\ast 2 \\
x \ /= \sigma^2

--Use the train set $ \ mu $ and $ \ sigma $ when normalizing the dev set.

(C2W1L10) Vanishing / exploding gradients

Contents

--When training a very deep neural network, the derivative becomes very small or large. Especially when it gets smaller, the steepest descent method takes time. --The current neural network has about 150 layers

(C2W1L11) Weight initialization for deep networks

Contents

--The more input features, the larger $ z $ calculated by $ z = wx + b $. Therefore, when there are many input features, make w small at the time of initialization.

W^{[l]} = \mathrm{np.random.randn} \left( \cdots \right) \ast \mathrm{np.sqrt} \left( \frac{2}{n^{[l-1]}} \right)

--For ReLU, $ \ sqrt {\ frac {2} {n ^ {[l-1]}}} $ works fine -$ \ tanh $ should be $ \ sqrt {\ frac {1} {n ^ {[l-1]}}} $ (Xavier initialization)

(C2W1L12) Numerial Approximation of Gradients

Contents

--The approximate value of the derivative is $ \ frac {f (\ theta + \ epsilon) --f (\ theta-\ epsilon)} {2 \ epsilon} $ with $ \ epsilon $ as a small number. --The order of error is $ O (\ epsilon ^ 2) $

(C2W1L13) Gradient checking

Contents

d\theta_{approx}^{[i]} = \frac{J(\theta_1, \cdots, \theta_i+\epsilon, \cdots) - J(\theta_1, \cdots, \theta_i-\epsilon, \cdots)}{2\epsilon} \sim d\theta^{[i]}

--check ($ \ epsilon = 10 ^ {-7} $)

\frac{\|d\theta_{approx} - d\theta\|_2}{\|d\theta_{approx}\|_2 + \|d\theta\|_2}
value judgement
10^{-7} great!
10^{-5} It may be OK, but check
10^{-3} Possible bug

--If it looks like a bug, check where the difference between $ d \ theta_ {approx} $ and $ d \ theta $ is large for a specific $ i $.

(C2W1L14) Gradient Checking Implementation Notes

Contents

--How to use gradient checking and how to deal with when $ d \ theta_ {approx} $ and $ d \ theta $ are different.

reference

-Deep Learning Specialization (Coursera) Self-study record (table of contents)

Recommended Posts

Deep Learning Specialization (Coursera) Self-study record (C3W1)
Deep Learning Specialization (Coursera) Self-study record (C1W3)
Deep Learning Specialization (Coursera) Self-study record (C4W3)
Deep Learning Specialization (Coursera) Self-study record (C1W4)
Deep Learning Specialization (Coursera) Self-study record (C2W1)
Deep Learning Specialization (Coursera) Self-study record (C1W2)
Deep Learning Specialization (Coursera) Self-study record (C3W2)
Deep Learning Specialization (Coursera) Self-study record (C2W2)
Deep Learning Specialization (Coursera) Self-study record (C4W1)
Deep Learning Specialization (Coursera) Self-study record (C2W3)
Deep Learning Specialization (Coursera) Self-study record (C4W2)
Learning record
Learning record # 3
Learning record # 1
Learning record # 2
Deep Learning
Learning record of reading "Deep Learning from scratch"
"Deep Learning from scratch" Self-study memo (Part 12) Deep learning
"Deep Learning from scratch" self-study memo (unreadable glossary)
"Deep Learning from scratch" Self-study memo (9) MultiLayerNet class
Deep Learning Memorandum
Start Deep learning
Python Deep Learning
Deep learning × Python
"Deep Learning from scratch" Self-study memo (10) MultiLayerNet class
"Deep Learning from scratch" Self-study memo (No. 11) CNN
"Deep Learning from scratch" Self-study memo (No. 19) Data Augmentation
"Deep Learning from scratch 2" Self-study memo (No. 21) Chapters 3 and 4
Learning record so far
First Deep Learning ~ Struggle ~
Python: Deep Learning Practices
Go language learning record
Deep learning / activation functions
Deep Learning from scratch
Learning record 4 (8th day)
Learning record 9 (13th day)
Learning record 3 (7th day)
Deep learning 1 Practice of deep learning
Deep learning / cross entropy
Learning record 5 (9th day)
Learning record 6 (10th day)
First Deep Learning ~ Preparation ~
Programming learning record day 2
First Deep Learning ~ Solution ~
Learning record 8 (12th day)
[AI] Deep Metric Learning
Learning record 1 (4th day)
Learning record 7 (11th day)
I tried deep learning
Python: Deep Learning Tuning
Learning record 2 (6th day)
Deep learning large-scale technology
Linux learning record ① Plan
Learning record 16 (20th day)
Learning record 22 (26th day)
Deep learning / softmax function
"Deep Learning from scratch" self-study memo (No. 18) One! Meow! Grad-CAM!
"Deep Learning from scratch" self-study memo (No. 19-2) Data Augmentation continued
"Deep Learning from scratch" self-study memo (No. 15) TensorFlow beginner tutorial