This is the content of Course 2, Week 1 (C2W1) of Deep Learning Specialization.
(C2W1L01) Train / Dev / Test sets
--Applied ML is a highly iterative process. It is important to efficiently rotate the cycle of Idea → Code → Experiment → Idea…
(C2W1L02) Bias / Variance
――High bias and high variance can be visualized in 2D, but not in high dimension.
train set error | dev set error | |
---|---|---|
1% | 11% | high variance |
15% | 16% | high bias |
15% | 30% | high bias & high variance |
0.5% | 1% | low bias & low variance |
--When the error (optimal error or Bayes error) when judged by a person is set to 0
(C2W1L03) Basic "recipe" for machine learning
--In case of high bias (check with training data performance) - bigger network - train longer -(NN architecture search) (may not be useful) --Repeat until high bias is resolved --For high variance (check with dev set performance) - more data - regularization -(NN architecture search) (may not be useful) --Bias and variance trade-offs were a problem in early neural networks --Now that we have more data, we can improve the variance without deteriorating the bias.
(C2W1L04) Regularization
J\left(w, b\right) = \frac{1}{m} \sum^{m}_{i=1}L\left(\hat{y}^{(i)}, y^{(i)}\right) + \frac{\lambda}{2m}\|w\|^2_2
-$ \ lambda $; Normalization parameter (one of hyperparameters)
lambd
when programming in PythonJ\left(w^{[1]}, b^{[1]}, \cdots , w^{[L]}, b^{[l]}\right) = \frac{1}{m} \sum^{m}_{i=1} L\left(\hat{y}^{(i)}, y^{(i)}\right) + \frac{\lambda}{2m} \sum^{L}_{l=1} \|w^{[l]}\|^2\\
\|w^{[l]}\|^2 = \sum^{n^{[l-1]}}_{i=1}\sum^{n^{[l]}}_{j=1}\left(w_{ij}^{[l]}\right)^2
-The dimension of $ w $ is $ (n ^ {[l-1]}, n ^ {[l]}) $
dw^{[l]} = \left( \textrm{from backprop} \right) + \frac{\lambda}{m}w^{[l]} \\
w^{[l]} = w^{[l]} - \alpha dw^{[l]} = \left(1 - \alpha \frac{\lambda}{m} \right)w^{[l]} - \alpha \left( \textrm{from backprop} \right)
--It is called weight decay because regularization makes $ dw ^ {[l]} $ smaller.
(C2W1L05) Why Regularization Reduces Overfitting
-If $ \ lambda $ is large, it will be $ w ^ {[l]} \ sim 0 $. Then, the influence of the hidden unit can be reduced, and it is considered that the network became simple. So get closer to high bias -Large $ \ lambda $ is closer to logistic regression -If $ g (z) = \ tanh (z) $, if $ z $ is small, the linear region of $ g (z) $ will be used. --When the activation function can be regarded as a linear function, it becomes impossible to represent a complex network. Therefore, it approaches high bias --When using the steepest descent method to confirm that $ J $ is getting smaller for each iteration, calculate $ J $ including the second term.
(C2W1L06) Dropout Regularization
--Drop out each unit with a certain probability (drop the unit) --Train with a reduced neural network -Assuming $ l = 3 $ (layer 3). Let keep_prob (= 0.8) be the probability of survival (the probability that 1-keep_prob will drop out). If dropout vector is $ d3 $
d3 = \mathrm{np.random.rand(} a3 \mathrm{.shape[0], }\, a3 \mathrm{.shape[1])} < \mathrm{keep\_prob} \\
a3 = \mathrm{np.multiply(} a3, d3 \mathrm{)} \\
a3\ /= \mathrm{keep\_prob} \\
a^{[4]} = W^{[4]} a^{[3]} + b^{[4]}
--Finally, keep the expected value of $ a ^ {[3]} $ by dividing by keep_prob. --Change dropout vector $ d3 $ for each iteration of steepest descent --Do not implement dropout when calculating test set (if dropout is included in the test, it will be noisy)
(C2W1L07) Understanding dropout
(C2W1L08) Other Regularization Methods
(C2W1L09) Normalizing inputs
--Normalize the input feature when the scale of the input feature is significantly different. By doing so, the steepest descent method can be calculated quickly.
\mu = \frac{1}{m} \sum^{m}_{i=1} x^{(i)} \\
x := x - \mu \\
\sigma^2 = \frac{2}{m} \sum^{m}_{i=1} x^{(i)} \ast\ast 2 \\
x \ /= \sigma^2
--Use the train set $ \ mu $ and $ \ sigma $ when normalizing the dev set.
(C2W1L10) Vanishing / exploding gradients
--When training a very deep neural network, the derivative becomes very small or large. Especially when it gets smaller, the steepest descent method takes time. --The current neural network has about 150 layers
(C2W1L11) Weight initialization for deep networks
--The more input features, the larger $ z $ calculated by $ z = wx + b $. Therefore, when there are many input features, make w small at the time of initialization.
W^{[l]} = \mathrm{np.random.randn} \left( \cdots \right) \ast \mathrm{np.sqrt} \left( \frac{2}{n^{[l-1]}} \right)
--For ReLU, $ \ sqrt {\ frac {2} {n ^ {[l-1]}}} $ works fine -$ \ tanh $ should be $ \ sqrt {\ frac {1} {n ^ {[l-1]}}} $ (Xavier initialization)
(C2W1L12) Numerial Approximation of Gradients
--The approximate value of the derivative is $ \ frac {f (\ theta + \ epsilon) --f (\ theta-\ epsilon)} {2 \ epsilon} $ with $ \ epsilon $ as a small number. --The order of error is $ O (\ epsilon ^ 2) $
(C2W1L13) Gradient checking
d\theta_{approx}^{[i]} = \frac{J(\theta_1, \cdots, \theta_i+\epsilon, \cdots) - J(\theta_1, \cdots, \theta_i-\epsilon, \cdots)}{2\epsilon} \sim d\theta^{[i]}
--check ($ \ epsilon = 10 ^ {-7} $)
\frac{\|d\theta_{approx} - d\theta\|_2}{\|d\theta_{approx}\|_2 + \|d\theta\|_2}
value | judgement |
---|---|
great! | |
It may be OK, but check | |
Possible bug |
--If it looks like a bug, check where the difference between $ d \ theta_ {approx} $ and $ d \ theta $ is large for a specific $ i $.
(C2W1L14) Gradient Checking Implementation Notes
--How to use gradient checking and how to deal with when $ d \ theta_ {approx} $ and $ d \ theta $ are different.
-Deep Learning Specialization (Coursera) Self-study record (table of contents)
Recommended Posts