While reading "Deep Learning from scratch" (written by Yasuki Saito, published by O'Reilly Japan), I will make a note of the sites I referred to.

It is natural that new words and terms will appear in new fields, but if they are abbreviations that take the head of the alphabet, non-English people do not know how to read them. A long time ago, when I taught COBOL at a vocational school, the first thing students stumbled upon was the pronunciation and meaning of the instructions used in COBOL. I can't remember what I can't pronounce, and I can't use it.

By saying that

Here are some of the words I've looked up in this book. I'm reading, so it's in the order they appear in the book.

Chapter 1

Python
Python Originally an English word meaning "Nishiki snake", so read it with that pronunciation. Why a snake? Speaking of which, there is a TV program called Monty Python ... (Since it is a famous story, it is omitted)

NumPy　 Nampai This is originally a python library for numerical operation, so I wondered if it would be okay to use number nan and python pie.

SciPy　 Saipai Because it is for science and technology calculation (science), Saipai. A long time ago, there was talk of calling science fiction SF or SciFi, but it's a completely different story.

TensorFlow Tensor flow In Japan, some people read it as "tensor flow", but google people read it as tensor flow. The word tensor is a tensor in English reading and a tensor in German. "Tensor" is the mainstream in Japanese science and technology related literature. It may be a word that was established before the war.

Matplotlib　 Matt Plot Rib A library that allows you to use the same thing as the numerical analysis software called MATLAB with python. Mat seems to be the original word for Matrix. plot is the English word "draw a graph". A library that graphs a matrix (array).

Anaconda Anaconda Originally the name of a South American snake. Since python is a snake, the distribution is probably the name of the snake.

Chapter 2

AND 　NAND 　OR 　XOR And Nando or Exor is the basis of logical operations.

Chapter 3

activation function　 Activation activation function Later, the word activation (output data after the activation function) comes up.

sigmoid function Sigmoid function sigmoid is an S-shape

exp() 　exponential function An exponential function with e (number of napiers) as the base. exp (x) is e to the xth power. It seems that some people read it properly as Exponential X, but the meaning is understood by E's X multiplication.

ReLU（Rectified Linear Unit）　 Rel A unit that uses a rectified linear function. If you connect the reading of the original word, it becomes a rel. There seems to be another name for the ramp function, but it would be unnatural to read it again.

softmax　 Softmax A soft version of the function that indicates which of the arguments is the maximum. Later there will be a function called argmax

MNIST　Modified National Institute of Standards and Technology database　 Emnist NIST is the National Institute of Standards and Technology, a government agency that standardizes US technology, industry, and industry. Is it a modified database created by NIST?

Chapter 4

Loss function

The loss function is an indicator of the "badness" of neural network performance. It shows how well the current neural network fits the teacher data and how much it does not match the teacher data.

Gradient

Learning rate

Learning rate $ \ Eta $ Eta → How to read Greek letters

differential

\frac{dy}{dx}DWD X

Partial differential

\frac{\delta f}{ \delta x}Del F Del X

Chapter 5

Layer

layer Lay (stack flat)

Affine　 It seems that there are terms such as affine transformation and affine space.

Chapter 6

SGD　stochastic gradient descent SGD: This kind of thing that can't be read is easy to understand. Stochastic Gradient Descent Stochastic Gradient Descent Method One of the methods to find the optimum parameters Rather than a probability, it feels like you're poking at the stock and looking for a direction where the gradient is going down.

Momentum Momentum Momentum One of the methods to find the optimum parameter

AdaGrad　 Adagrad? If you connect the readings of the original words, it will look like this. Adaptive Subgradient Methods Adaptive Subgradient Methods? One of the methods to find the optimum parameter

Adam　 Adam AdaGrad and Momentum fused? One of the methods to find the optimum parameter

Gaussian distribution with a standard deviation of 0.01

The Gaussian distribution is also called the normal distribution. Among the normal distributions, the distribution of μ (mu mean value) = 0 and $ σ ^ 2 $ (sigma square variance) = 1 is called the standard normal distribution, and can be generated by the randn method in the rondom module of the Numpy library.

np.random.randn(node_num, node_num)

Also, if you use the normal method, unlike randn, you can generate a random number array that follows a normal distribution with μ (mean value) and σ (sigma standard deviation) arbitrarily specified.

np.random.normal(loc=0, scale=0.01, size=(node_num, node_num))

However, P178 uses the randn method instead of the normal method.

P178 So far, the initial value of the weight has been We used a small value, such as 0.01 * np.random.randn (10, 100), which is 0.01 times the value generated from the Gaussian distribution-a Gaussian distribution with a standard deviation of 0.01.

I'm not familiar with this area, but when I standardize the normally distributed X and convert it to the standard normal distribution Z It seems that $ Z = \ frac {X --μ} {σ} $, so if the mean μ = 0 and the standard deviation σ = 0.01, Because $ Z = \ frac {X} {0.01} $ X = 0.01 ×Z Does that mean?

Limitation of expressiveness

P182 The activation distribution for each layer is required to have a moderate spread. This is because a neural network is created by the flow of moderately diverse data through each layer. This is because work can be learned efficiently. On the contrary, when biased data flows, Learning may not be successful due to problems such as loss of distribution and "limitation of expressiveness". I will.

representation? I didn't understand the meaning of the word "expressiveness" that suddenly appeared, so I searched for some references that could be helpful.

Expressiveness is a concept that expresses the size of a set of functions that a machine learning model can approximate. → Mathematical analysis and design based on the expressive power of deep neural networks with residual skip connections

One of the most shocking facts about neural networks is the ability to represent arbitrary functions. For example, suppose someone gives you a complex and wavy function f (x): for every possible input x, whatever function it is, the output value is f (x) (or an approximation thereof). ) There is a neural network. → Visual proof that neural networks can represent arbitrary functions

Even with a three-layer perceptron, any function can be approximated with arbitrary precision by increasing the number of units in the intermediate layer infinitely. So why is it better to be deep? Because the power of expression increases exponentially with respect to depth → Machine learning technology and its mathematical foundation

Initial value of Xavier

From the researcher's name Xavier Glorot Xavier? Missionary?

Initial value of He

From the researcher's name, Kaiming He. what? Who?

tanh function

Hyperbolic tangent hyperbolic function \tanh x=\frac{\sinh x}{\cosh x}

Batch Normalization Batch normalization Normalization is performed for each mini-batch in units of mini-batch when learning.

Robust to initial value

robust <People, physique, etc.> Robust, strong, and solid

P259 Robustness here means, for example, that a neural network is used as an input image. It has a robust property that the output result does not change even if a small noise is added. about it. Thanks to such robustness, the data flowing through the network Even if it is "deteriorated", it can be considered that the effect on the output result is small.

Overfitting

Overfitting I have learned to fit only the training data.

Weight decay Weight decay load damping A technique often used for a long time to suppress overfitting

L2 norm

Amount representing the "size" of various things The p-th order norm of the M-dimensional vector x is the quantity defined below.

∥x∥_p =(∑^M_i|x_i|^p)^{1/p}= \sqrt[p]{|x_1|^p + · · · + |x_M|^p}

The L2 norm is the quadratic norm of the M-dimensional vector x.

∥x∥_2 =(∑^M_i|x_i|^2)^{1/2}= \sqrt{|x_1|^2 + · · · + |x_M|^2}

When this is two-dimensional, it looks like this.

∥x∥_2 = \sqrt{x_1^2 + x_2^2}

If you look at this from the Pythagorean theorem, it is the length of the hypotenuse of a right triangle. If you square and remove the route

r^2 = x^2 + y^2

Then, it becomes an equation of a circle with radius r centered on the origin.

If the weight is W, the weight decay of the L2 norm is $ \ frac {1} {2} \ lambda \ W ^ 2 $, which is $ \ frac {1} {2} \ lambda \ W ^ 2 Add $ to the loss function.

So, add $ 0.5 x λ x (w_1 ^ 2 + ··· + w_M ^ 2) $ to the loss function. As an image, I draw a circle with the origin of the loss function 0 (the place where it becomes overfitting) and learn toward it, but I feel that I can not put it in the circumference? When exception data is mixed in the training data and the weight w becomes large in an attempt to match the exception, the circle also becomes large and learning is jammed, and when the exception data is ignored and the weight w becomes small, the circle also becomes small and learning. Does that mean you don't jam?

λ Lambda Weight Decay (L2 norm) strength

Dropout Dropout: A method of learning while randomly erasing neurons to suppress overfitting

Hyper-parameter

Number of neurons in each layer, batch size, learning coefficient when updating parameters, weight decay, etc.

Stanford University class "CS231n"

CS231n: Convolutional Neural Networks for Visual Recognition

Chapter 7

CNN　convolutional neural network CNN: This kind of thing that can't be read is easy to understand Convolutional neural network convolutional convolutional swirling

Convolution layer

Convolution swirl It feels like the filter spins over the input data?

Padding

pad "Filling for shaping" Fill the area around the input data with 0s to shape it.

Stride

Straddle ~

Pooling layer

Pooling Shared use

im2col It's an abbreviation for image to column, so Imtukar? From image to matrix

Primitive information such as edges and blobs

LeNet Lunette? If reading in French, read Le as le. A convolutional neural network created by Frenchman Yann LeCun

AlexNet Alexnet Convolutional neural network created by Alex Krizhevsky

Referenced site

How to read Greek letters Qiita Formula Cheat Sheet [Machine learning] What is the LP norm?

"Deep Learning from scratch" self-study memo (unreadable glossary)