Introduction

This is the content of Course 2, Week 1 (C2W1) of Deep Learning Specialization.

(C2W1L01) Train / Dev / Test sets

--Applied ML is a highly iterative process. It is important to efficiently rotate the cycle of Idea → Code → Experiment → Idea…

Train set / dev set / test set --Previously it was said that train: test = 70: 30 and train: dev (cross-validation): test = 60: 20: 20 ――Recently, the amount of data has increased. If there are about 1 million data, 10,000 dev sets and 10,000 test sets are enough. --train: dev: test = 98: 1: 1, 99.5: 0.25: 0.25, etc. --Make sure dev and test come from the same distribution --OK without test set (even with dev set alone) --dev set is sometimes called "test set"

(C2W1L02) Bias / Variance

――High bias and high variance can be visualized in 2D, but not in high dimension.

train set error	dev set error
1%	11%	high variance
15%	16%	high bias
15%	30%	high bias & high variance
0.5%	1%	low bias & low variance

--When the error (optimal error or Bayes error) when judged by a person is set to 0

(C2W1L03) Basic "recipe" for machine learning

--In case of high bias (check with training data performance) - bigger network - train longer -(NN architecture search) (may not be useful) --Repeat until high bias is resolved --For high variance (check with dev set performance) - more data - regularization -(NN architecture search) (may not be useful) --Bias and variance trade-offs were a problem in early neural networks --Now that we have more data, we can improve the variance without deteriorating the bias.

(C2W1L04) Regularization

J\left(w, b\right) = \frac{1}{m} \sum^{m}_{i=1}L\left(\hat{y}^{(i)}, y^{(i)}\right) + \frac{\lambda}{2m}\|w\|^2_2

-$ \ lambda $; Normalization parameter (one of hyperparameters)

L_2 regularization ; ||w||^2_2 = \sum^{n_x}_{j=1} w_j^2 = w^Tw --lambda is a Python command, so write lambd when programming in Python

J\left(w^{[1]}, b^{[1]}, \cdots , w^{[L]}, b^{[l]}\right) = \frac{1}{m} \sum^{m}_{i=1} L\left(\hat{y}^{(i)}, y^{(i)}\right) + \frac{\lambda}{2m} \sum^{L}_{l=1} \|w^{[l]}\|^2\\
\|w^{[l]}\|^2 = \sum^{n^{[l-1]}}_{i=1}\sum^{n^{[l]}}_{j=1}\left(w_{ij}^{[l]}\right)^2

-The dimension of $ w $ is $ (n ^ {[l-1]}, n ^ {[l]}) $

|| \cdot ||^2_2Is Frobenius norm(|| \cdot ||_F^2)Also called

dw^{[l]} = \left( \textrm{from backprop} \right) + \frac{\lambda}{m}w^{[l]} \\
w^{[l]} = w^{[l]} - \alpha dw^{[l]} = \left(1 - \alpha \frac{\lambda}{m} \right)w^{[l]} - \alpha \left( \textrm{from backprop} \right)

--It is called weight decay because regularization makes $ dw ^ {[l]} $ smaller.

(C2W1L05) Why Regularization Reduces Overfitting

-If $ \ lambda $ is large, it will be $ w ^ {[l]} \ sim 0 $. Then, the influence of the hidden unit can be reduced, and it is considered that the network became simple. So get closer to high bias -Large $ \ lambda $ is closer to logistic regression -If $ g (z) = \ tanh (z) $, if $ z $ is small, the linear region of $ g (z) $ will be used. --When the activation function can be regarded as a linear function, it becomes impossible to represent a complex network. Therefore, it approaches high bias --When using the steepest descent method to confirm that $ J $ is getting smaller for each iteration, calculate $ J $ including the second term.

(C2W1L06) Dropout Regularization

--Drop out each unit with a certain probability (drop the unit) --Train with a reduced neural network -Assuming $ l = 3 $ (layer 3). Let keep_prob (= 0.8) be the probability of survival (the probability that 1-keep_prob will drop out). If dropout vector is $ d3 $

d3 = \mathrm{np.random.rand(} a3 \mathrm{.shape[0], }\, a3 \mathrm{.shape[1])} < \mathrm{keep\_prob} \\
a3 = \mathrm{np.multiply(} a3, d3 \mathrm{)} \\
a3\  /= \mathrm{keep\_prob} \\
a^{[4]} = W^{[4]} a^{[3]} + b^{[4]}

--Finally, keep the expected value of $ a ^ {[3]} $ by dividing by keep_prob. --Change dropout vector $ d3 $ for each iteration of steepest descent --Do not implement dropout when calculating test set (if dropout is included in the test, it will be noisy)

(C2W1L07) Understanding dropout

Intuition ; Can't rely on any one feature, so have to spread out weights --Do not weight specific inputs → Disperse weights → Reduce the effect of weights → Same effect as $ L_2 $ regularization --You can also change the value of keep_prob for each layer. Make keep_prob smaller for $ W ^ {[l]} $ with larger dimensions --There are many successful cases of dropout in computer vision (the number of pixels = many features and relatively small amount of data, so it tends to be high variance) --Disadvantages; Difficult to check if cost function $ J $ is calculated correctly --Countermeasure; Make sure that $ J $ decreases with keep_prob = 1 (there is no problem with coding), and then enable keep_prob.

(C2W1L08) Other Regularization Methods

Data augmentation --Prevent overfitting by flipping the image left and right and increasing the data by rotating and enlarging. ――You cannot expect the effect of increasing completely independent data, but you can increase the data at low cost.
Early stopping --Check both training error (or $ J $) and dev set error at each iteration, and stop the iteration calculation when the dev set error changes from decreasing to increasing. ――The disadvantage is that you think about $ J $ and overfit at the same time, which complicates the problem. --The workaround is to use $ L_2 $ regularization (the downside is that you have to worry about the value of $ \ lambda $)

(C2W1L09) Normalizing inputs

--Normalize the input feature when the scale of the input feature is significantly different. By doing so, the steepest descent method can be calculated quickly.

\mu = \frac{1}{m} \sum^{m}_{i=1} x^{(i)} \\
x := x - \mu \\
\sigma^2 = \frac{2}{m} \sum^{m}_{i=1} x^{(i)} \ast\ast 2 \\
x \ /= \sigma^2

--Use the train set $ \ mu $ and $ \ sigma $ when normalizing the dev set.

(C2W1L10) Vanishing / exploding gradients

--When training a very deep neural network, the derivative becomes very small or large. Especially when it gets smaller, the steepest descent method takes time. --The current neural network has about 150 layers

(C2W1L11) Weight initialization for deep networks

--The more input features, the larger $ z $ calculated by $ z = wx + b $. Therefore, when there are many input features, make w small at the time of initialization.

W^{[l]} = \mathrm{np.random.randn} \left( \cdots \right) \ast \mathrm{np.sqrt} \left( \frac{2}{n^{[l-1]}} \right)

--For ReLU, $ \ sqrt {\ frac {2} {n ^ {[l-1]}}} $ works fine -$ \ tanh $ should be $ \ sqrt {\ frac {1} {n ^ {[l-1]}}} $ (Xavier initialization)

(C2W1L12) Numerial Approximation of Gradients

--The approximate value of the derivative is $ \ frac {f (\ theta + \ epsilon) --f (\ theta-\ epsilon)} {2 \ epsilon} $ with $ \ epsilon $ as a small number. --The order of error is $ O (\ epsilon ^ 2) $

(C2W1L13) Gradient checking

Take W^{[1]}，b^{[1]}，\cdots ，W^{[L]}，b^{[L]} and reshape into a big vector \theta
Take dW^{[1]}，db^{[1]}，\cdots ，dW^{[L]}，db^{[L]} and reshape into a big vector d\theta
for each i :

d\theta_{approx}^{[i]} = \frac{J(\theta_1, \cdots, \theta_i+\epsilon, \cdots) - J(\theta_1, \cdots, \theta_i-\epsilon, \cdots)}{2\epsilon} \sim d\theta^{[i]}

--check ($ \ epsilon = 10 ^ {-7} $)

\frac{\|d\theta_{approx} - d\theta\|_2}{\|d\theta_{approx}\|_2 + \|d\theta\|_2}

value	judgement
10^{-7}	great!
10^{-5}	It may be OK, but check
10^{-3}	Possible bug

--If it looks like a bug, check where the difference between $ d \ theta_ {approx} $ and $ d \ theta $ is large for a specific $ i $.

(C2W1L14) Gradient Checking Implementation Notes

--How to use gradient checking and how to deal with when $ d \ theta_ {approx} $ and $ d \ theta $ are different.

reference

-Deep Learning Specialization (Coursera) Self-study record (table of contents)

Deep Learning Specialization (Coursera) Self-study record (C2W1)

Introduction

Contents

Contents

Contents

Contents

Contents

Contents

Contents

Contents

Contents

Contents

Contents

Contents

Contents

Contents

reference