Introduction
This is the content of Course 4, Week 2 (C4W2) of Deep Learning Specialization.
(C4W2L01) Why look at case studies?
Contents
--This week's outline
- Classic Networks
- LeNet-5
- AlexNet
- VGG
- ResNet (152 layers)
- Inception
(C4W2L02) Classic Networks
Contents
LeNet-5 (1998)
- Input ; 32 x 32 x 1
- CONV (5x5, s=1) ; 28 x 28 x 1
- Avg POOL (f=2, s=2) ; 14 x 14 x 6
- CONV (5x5, s=1) ; 10 x 10 x 16
- Avg POOL (f=2, s=2) ; 5 x 5 x 6
- FC; 120 parameters
- FC; 84 parameters
- \hat{y}
--Number of parameters; 60k
-$ n_H $, $ n_W $ become smaller and $ n_C $ becomes larger
--Typical networks such as CONV, POOL, CONV, POOL, FC, FC
AlexNet (2012)
- Input ; 227x227x3
- CONV (11x11, s=4) ; 55 x 55 x 96
- Max POOL (3x3, s=2) ; 27 x 27 x 96
- CONV (5x5, same) ; 27 x 27 x 256
- Max POOL (3x3, s=2) ; 13 x 13 x 256
- CONV (3x3, same) ; 13 x 13 x 384
- CONV (3x3, same) ; 13 x 13 x 384
- CONV (3x3, same) ; 13 x 13 x 256
- Max POOL (3x3, s=2) ; 6 x 6 x 256
- FC; 4096 parameters
- FC; 4096 parameters
- softmax; 1000 parameters
- Similarity to LeNet, but much bigger (\sim 60M parameters)
- ReLU
- Multiple GPUs
- Local Response Normalization
VGG-16
- CONV = 3x3 filter, s=1, same
- Max POOL = 2x2, s=2
- Input ; 224 x 224 x 3
- CONV64 x 2 ; 224 x 224 x 64
- POOL ; 112 x 112 x 64
- CONV128 x 2 ; 112 x 112 x 128
- POOL ; 56 x 56 x 128
- CONV256 x 3; 56 x 56 x 256
- POOL ; 28 x 28 x 256
- CONV512 x 3; 28 x 28 x 512
- POOL ; 14 x 14 x 512
- CONV512 x 3 ; 14 x 14 x 512
- POOL ; 7 x 7 x 512
- FC; 4096 parameters
- FC; 4096 parameters
- Softmax; 1000 parameters
--Number of parameters; $ \ sim $ 138M
--Relative structural uniformity
Impressions
――The speed of the times when 2015 is expressed as classic
(C4W2L03) Residual Networks (ResNet)
Contents
- Residual block
- a^{[l]}
- z^{[l+1]} = W^{[l+1]}a^{[l]} + b^{[l+1]}
- a^{[l+1]} = g(z^{[l+1]})
- z^{[l+2]} = W^{[l+2]}a^{[l+1]} + b^{[l+2]}
- a^{[l+1]} = g(z^{[l+1]} + a^{[l]})
--The deeper the layer, the larger the training error in a normal network.
――But with ResNet, training errors decrease even if it exceeds 100 layers. Effective for learning deep networks
(C4W2L04) Why ResNets works
Contents
--If $ W ^ {[l + 2]} = 0 $, $ b ^ {[l + 2]} = 0 $, then $ a ^ {[l + 2]} = g (a ^ {[l ]}) = a ^ {[l]} $
--Identity function is easy for Residual block to learn (Residual block makes it easy to learn an identity function)
(C4W2L05) Network in network and 1x1 convolutions
Contents
- Why does a 1x1 convolution do?
- (6 \times 6 \times 32) \ast (1 \times 1 \times 32) = (6 \times 6 \times
\textrm{#filters})
--Image of applying fully connected (FC) to each pixel of input
--Also known as Network in Network
- Using 1x1 convolutions
- Input ; 28 \times 28 \times 192
- ReLU, CONV 1x1, 32 filters → Output ; 28 \times 28 \times 32
-$ n_C $ can be reduced
(C4W2L06) Inception network motivation
Contents
--Apply the following to Input (28x28x192) and combine them.
- 1x1 → Output ; 28x28x64
- 3x3 → Output ; 28x28x128
- 5x5 → Output ; 28x28x32
- Max POOL → Output ; 28x28x32
--Total 28x28x256
--Apply all the different filter sizes and pooling and let the network choose the right one
- The problem of computational cost
- Input ; 28x28x192
- CONV 5x5, same, 32 → Output ; 28x28x32
--Calculation cost; 28x28x32x5x5x192 = 120M
- Using 1x1 convolution
- Input ; 28x28x192
- CONV 1x1, 16 → Output ; 28x28x16 (bottle neck layer)
- CONV 5x5, 32 → Output ; 28x28x32
--Calculation cost; 28x28x16x192 + 28x28x32x5x5x16 = 12.4M (1/10 above)
--If the bottle neck layer is designed properly, the amount of calculation can be reduced without affecting the performance.
(C4W2L07) Inception network
Contents
--Description of Inception network with bottle neck layer
--Called GoogLeNet
(C4W2L08) Using open-source implementation
Contents
--Description of downloading the source code from GitHub (`` `git clone```)
(C4W2L09) Transfer Learning
Contents
-$ x $ → layer → layer → $ \ cdots $ → layer → softmax → $ \ hat {y} $
--When there is little data, train only softmax (other parameters are fixed)
--For large datasets, for example, the latter half of the layer is trained and the first half of the layer is fixed.
--Learning the entire network when there is very large data
(C4W2L10) Data augumentation
Contents
- Common augmentation method
- mirroring
- random cropping
--The following are rarely used
- rotation
- shearing
- local warping
- color shifting
--Add (or subtract) numbers to R, G, B
--AlexNet's paper describes the method of PCA color augmentation.
- Implementing distortions during training
--After loading one image, distort the image and train them as mini-batch.
(C4W2L11) The state of computer vision
Contents
--The current amount of data is sensuously speech recognition $ \ gt $ image recognition $ \ gt $ object detection (recognizes where the object is)
--If you have a lot of data, you can use a simple algorithm with less hand-engineering.
--Hand-engineering or hack increases when data is low
- Two sources of knowledge
- labeled data
- hand-engineering features / network architecture / other components
--Transfer Learning is useful when there is little data
- Tips for doing well on benchmark / winning competitions
- Ensemble
- Training several networks independently and average their output (3 ~ 15 networks)
- Multi-crop at test time
- Run classifier on multiple versions of test images and average results (10 crops)
- Use open source code
- Use architecture of network published in the literature
- Use open source implementations if possible
- Use pre-trained models and fine-tune on your data set
reference
--This week's exercise is the implementation of ResNet
reference
-Deep Learning Specialization (Coursera) Self-study record (table of contents)