"Deep Learning from scratch" self-study memo (unreadable glossary)

While reading "Deep Learning from scratch" (written by Yasuki Saito, published by O'Reilly Japan), I will make a note of the sites I referred to.

It is natural that new words and terms will appear in new fields, but if they are abbreviations that take the head of the alphabet, non-English people do not know how to read them. A long time ago, when I taught COBOL at a vocational school, the first thing students stumbled upon was the pronunciation and meaning of the instructions used in COBOL. I can't remember what I can't pronounce, and I can't use it.

By saying that

Here are some of the words I've looked up in this book. I'm reading, so it's in the order they appear in the book.

Chapter 1

Python
Python Originally an English word meaning "Nishiki snake", so read it with that pronunciation. Why a snake? Speaking of which, there is a TV program called Monty Python ... (Since it is a famous story, it is omitted)

NumPy  Nampai This is originally a python library for numerical operation, so I wondered if it would be okay to use number nan and python pie.

SciPy  Saipai Because it is for science and technology calculation (science), Saipai. A long time ago, there was talk of calling science fiction SF or SciFi, but it's a completely different story.

TensorFlow Tensor flow In Japan, some people read it as "tensor flow", but google people read it as tensor flow. The word tensor is a tensor in English reading and a tensor in German. "Tensor" is the mainstream in Japanese science and technology related literature. It may be a word that was established before the war.

Matplotlib  Matt Plot Rib A library that allows you to use the same thing as the numerical analysis software called MATLAB with python. Mat seems to be the original word for Matrix. plot is the English word "draw a graph". A library that graphs a matrix (array).

Anaconda Anaconda Originally the name of a South American snake. Since python is a snake, the distribution is probably the name of the snake.

Chapter 2

AND  NAND  OR  XOR And Nando or Exor is the basis of logical operations.

Chapter 3

activation function  Activation activation function Later, the word activation (output data after the activation function) comes up.

sigmoid function Sigmoid function sigmoid is an S-shape

exp()  exponential function An exponential function with e (number of napiers) as the base. exp (x) is e to the xth power. It seems that some people read it properly as Exponential X, but the meaning is understood by E's X multiplication.

ReLU(Rectified Linear Unit)  Rel A unit that uses a rectified linear function. If you connect the reading of the original word, it becomes a rel. There seems to be another name for the ramp function, but it would be unnatural to read it again.

softmax  Softmax A soft version of the function that indicates which of the arguments is the maximum. Later there will be a function called argmax

MNIST Modified National Institute of Standards and Technology database  Emnist NIST is the National Institute of Standards and Technology, a government agency that standardizes US technology, industry, and industry. Is it a modified database created by NIST?

Chapter 4

Loss function

Loss function

The loss function is an indicator of the "badness" of neural network performance. It shows how well the current neural network fits the teacher data and how much it does not match the teacher data.

Gradient

Gradient

Learning rate

Learning rate $ \ Eta $ Eta → How to read Greek letters

differential

\frac{dy}{dx}DWD X

Partial differential

\frac{\delta f}{ \delta x}Del F Del X

Chapter 5

Layer

layer Lay (stack flat)

Affine  It seems that there are terms such as affine transformation and affine space.

Chapter 6

SGD stochastic gradient descent SGD: This kind of thing that can't be read is easy to understand. Stochastic Gradient Descent Stochastic Gradient Descent Method One of the methods to find the optimum parameters Rather than a probability, it feels like you're poking at the stock and looking for a direction where the gradient is going down.

Momentum Momentum Momentum One of the methods to find the optimum parameter

AdaGrad  Adagrad? If you connect the readings of the original words, it will look like this. Adaptive Subgradient Methods Adaptive Subgradient Methods? One of the methods to find the optimum parameter

Adam  Adam AdaGrad and Momentum fused? One of the methods to find the optimum parameter

Gaussian distribution with a standard deviation of 0.01

The Gaussian distribution is also called the normal distribution. Among the normal distributions, the distribution of μ (mu mean value) = 0 and $ σ ^ 2 $ (sigma square variance) = 1 is called the standard normal distribution, and can be generated by the randn method in the rondom module of the Numpy library.

np.random.randn(node_num, node_num)

Also, if you use the normal method, unlike randn, you can generate a random number array that follows a normal distribution with μ (mean value) and σ (sigma standard deviation) arbitrarily specified.

np.random.normal(loc=0, scale=0.01, size=(node_num, node_num)) 

However, P178 uses the randn method instead of the normal method.

P178 So far, the initial value of the weight has been We used a small value, such as 0.01 * np.random.randn (10, 100), which is 0.01 times the value generated from the Gaussian distribution-a Gaussian distribution with a standard deviation of 0.01.

I'm not familiar with this area, but when I standardize the normally distributed X and convert it to the standard normal distribution Z It seems that $ Z = \ frac {X --μ} {σ} $, so if the mean μ = 0 and the standard deviation σ = 0.01, Because $ Z = \ frac {X} {0.01} $ X = 0.01 ×Z Does that mean?

Limitation of expressiveness

P182 The activation distribution for each layer is required to have a moderate spread. This is because a neural network is created by the flow of moderately diverse data through each layer. This is because work can be learned efficiently. On the contrary, when biased data flows, Learning may not be successful due to problems such as loss of distribution and "limitation of expressiveness". I will.

representation? I didn't understand the meaning of the word "expressiveness" that suddenly appeared, so I searched for some references that could be helpful.

Expressiveness is a concept that expresses the size of a set of functions that a machine learning model can approximate. → Mathematical analysis and design based on the expressive power of deep neural networks with residual skip connections

One of the most shocking facts about neural networks is the ability to represent arbitrary functions. For example, suppose someone gives you a complex and wavy function f (x): for every possible input x, whatever function it is, the output value is f (x) (or an approximation thereof). ) There is a neural network. → Visual proof that neural networks can represent arbitrary functions

Even with a three-layer perceptron, any function can be approximated with arbitrary precision by increasing the number of units in the intermediate layer infinitely. So why is it better to be deep? Because the power of expression increases exponentially with respect to depth → Machine learning technology and its mathematical foundation

Initial value of Xavier

From the researcher's name Xavier Glorot Xavier? Missionary?

Initial value of He

From the researcher's name, Kaiming He. what? Who?

tanh function

Hyperbolic tangent hyperbolic function \tanh x=\frac{\sinh x}{\cosh x} 6-6.jpg

Batch Normalization Batch normalization Normalization is performed for each mini-batch in units of mini-batch when learning.

Robust to initial value

robust <People, physique, etc.> Robust, strong, and solid

P259 Robustness here means, for example, that a neural network is used as an input image. It has a robust property that the output result does not change even if a small noise is added. about it. Thanks to such robustness, the data flowing through the network Even if it is "deteriorated", it can be considered that the effect on the output result is small.

Overfitting

Overfitting I have learned to fit only the training data.

Weight decay Weight decay load damping A technique often used for a long time to suppress overfitting

L2 norm

Amount representing the "size" of various things The p-th order norm of the M-dimensional vector x is the quantity defined below.

∥x∥_p =(∑^M_i|x_i|^p)^{1/p}= \sqrt[p]{|x_1|^p + · · · + |x_M|^p} 

The L2 norm is the quadratic norm of the M-dimensional vector x.

∥x∥_2 =(∑^M_i|x_i|^2)^{1/2}= \sqrt{|x_1|^2 + · · · + |x_M|^2} 

When this is two-dimensional, it looks like this.

∥x∥_2 = \sqrt{x_1^2 + x_2^2}

If you look at this from the Pythagorean theorem, it is the length of the hypotenuse of a right triangle. If you square and remove the route

r^2 = x^2 + y^2

Then, it becomes an equation of a circle with radius r centered on the origin.

so

If the weight is W, the weight decay of the L2 norm is $ \ frac {1} {2} \ lambda \ W ^ 2 $, which is $ \ frac {1} {2} \ lambda \ W ^ 2 Add $ to the loss function.

So, add $ 0.5 x λ x (w_1 ^ 2 + ··· + w_M ^ 2) $ to the loss function. As an image, I draw a circle with the origin of the loss function 0 (the place where it becomes overfitting) and learn toward it, but I feel that I can not put it in the circumference? When exception data is mixed in the training data and the weight w becomes large in an attempt to match the exception, the circle also becomes large and learning is jammed, and when the exception data is ignored and the weight w becomes small, the circle also becomes small and learning. Does that mean you don't jam?

λ Lambda Weight Decay (L2 norm) strength

Dropout Dropout: A method of learning while randomly erasing neurons to suppress overfitting

Hyper-parameter

Number of neurons in each layer, batch size, learning coefficient when updating parameters, weight decay, etc.

Stanford University class "CS231n"

CS231n: Convolutional Neural Networks for Visual Recognition

Chapter 7

CNN convolutional neural network CNN: This kind of thing that can't be read is easy to understand Convolutional neural network convolutional convolutional swirling

Convolution layer

Convolution swirl It feels like the filter spins over the input data?

Padding

pad "Filling for shaping" Fill the area around the input data with 0s to shape it.

Stride

Straddle ~

Pooling layer

Pooling Shared use

im2col It's an abbreviation for image to column, so Imtukar? From image to matrix

Primitive information such as edges and blobs

LeNet Lunette? If reading in French, read Le as le. A convolutional neural network created by Frenchman Yann LeCun

AlexNet Alexnet Convolutional neural network created by Alex Krizhevsky

Referenced site

How to read Greek letters Qiita Formula Cheat Sheet [Machine learning] What is the LP norm?

Recommended Posts

"Deep Learning from scratch" self-study memo (unreadable glossary)
"Deep Learning from scratch" Self-study memo (Part 12) Deep learning
"Deep Learning from scratch" Self-study memo (9) MultiLayerNet class
"Deep Learning from scratch" Self-study memo (10) MultiLayerNet class
"Deep Learning from scratch" Self-study memo (No. 11) CNN
"Deep Learning from scratch" Self-study memo (No. 19) Data Augmentation
"Deep Learning from scratch 2" Self-study memo (No. 21) Chapters 3 and 4
Deep Learning from scratch
[Learning memo] Deep Learning made from scratch [Chapter 7]
Deep learning / Deep learning made from scratch Chapter 6 Memo
[Learning memo] Deep Learning made from scratch [Chapter 5]
[Learning memo] Deep Learning made from scratch [Chapter 6]
"Deep Learning from scratch" self-study memo (No. 18) One! Meow! Grad-CAM!
"Deep Learning from scratch" self-study memo (No. 19-2) Data Augmentation continued
Deep learning / Deep learning made from scratch Chapter 7 Memo
"Deep Learning from scratch" self-study memo (No. 15) TensorFlow beginner tutorial
[Learning memo] Deep Learning made from scratch [~ Chapter 4]
"Deep Learning from scratch" self-study memo (No. 13) Try using Google Colaboratory
"Deep Learning from scratch" Self-study memo (No. 10-2) Initial value of weight
Deep Learning from scratch Chapter 2 Perceptron (reading memo)
[Learning memo] Deep Learning from scratch ~ Implementation of Dropout ~
Deep Learning from scratch 1-3 chapters
Deep Learning / Deep Learning from Zero 2 Chapter 4 Memo
Deep Learning / Deep Learning from Zero Chapter 3 Memo
Deep Learning / Deep Learning from Zero 2 Chapter 5 Memo
Deep learning from scratch (cost calculation)
Deep Learning / Deep Learning from Zero 2 Chapter 7 Memo
Deep Learning / Deep Learning from Zero 2 Chapter 8 Memo
Deep Learning / Deep Learning from Zero Chapter 5 Memo
Deep Learning / Deep Learning from Zero Chapter 4 Memo
Deep Learning / Deep Learning from Zero 2 Chapter 3 Memo
Deep Learning memos made from scratch
Deep Learning / Deep Learning from Zero 2 Chapter 6 Memo
"Deep Learning from scratch" Self-study memo (No. 16) I tried to build SimpleConvNet with Keras
"Deep Learning from scratch" Self-study memo (No. 17) I tried to build DeepConvNet with Keras
Deep learning from scratch (forward propagation edition)
Deep learning / Deep learning from scratch 2-Try moving GRU
"Deep Learning from scratch" in Haskell (unfinished)
[Windows 10] "Deep Learning from scratch" environment construction
Learning record of reading "Deep Learning from scratch"
[Deep Learning from scratch] About hyperparameter optimization
"Deep Learning from scratch" Self-study memo (No. 14) Run the program in Chapter 4 on Google Colaboratory
"Deep Learning from scratch" Self-study memo (Part 8) I drew the graph in Chapter 6 with matplotlib
Deep Learning from scratch ① Chapter 6 "Techniques related to learning"
Good book "Deep Learning from scratch" on GitHub
Django memo # 1 from scratch
Python vs Ruby "Deep Learning from scratch" Summary
[Deep Learning from scratch] I implemented the Affine layer
Application of Deep Learning 2 made from scratch Spam filter
[Deep Learning from scratch] I tried to explain Dropout
Deep learning / LSTM scratch code
[Deep Learning from scratch] Implementation of Momentum method and AdaGrad method
An amateur stumbled in Deep Learning from scratch Note: Chapter 1
Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 5
Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 2
Create an environment for "Deep Learning from scratch" with Docker
An amateur stumbled in Deep Learning from scratch Note: Chapter 3
An amateur stumbled in Deep Learning from scratch Note: Chapter 7
An amateur stumbled in Deep Learning from scratch Note: Chapter 5
Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 7
Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 1