Start studying: Saturday, December 7th
Teaching materials, etc .: ・ Miyuki Oshige "Details! Python3 Introductory Note ”(Sotec, 2017): 12/7 (Sat) -12/19 (Thu) read ・ Progate Python course (5 courses in total): 12/19 (Thursday) -12/21 (Saturday) end ・ Andreas C. Müller, Sarah Guido "(Japanese title) Machine learning starting with Python" (O'Reilly Japan, 2017): 12/21 (Sat) -December 23 (Sat) ・ Kaggle: Real or Not? NLP with Disaster Tweets: Posted on Saturday, December 28th to Friday, January 3rd Adjustment ・ Wes Mckinney "(Japanese title) Introduction to data analysis by Python" (O'Reilly Japan, 2018): 1/4 (Wednesday) to 1/13 (Monday) read ・ ** Yasuki Saito "Deep Learning from Zero" (O'Reilly Japan, 2016): 1/15 (Wed) ~ **
I started reading yesterday because I wanted to better understand neural networks and deep learning, which was a breakthrough in AI research. p.122 Chapter 4 Finish reading up to the learning of neural networks.
・ Basically, a review of what I have done so far (Python overview, environment construction, arithmetic mean, etc.) A chapter that outlines the knowledge necessary to continue reading this book Only the part that I did not understand a little is described
-Bool: A type that takes either True or False Operators such as and, or, not can be used.
Class definition
class class name:
def __init__(self,argument, …): #constructor
...
def method name 1(self,argument, …): #Method 1
...
def method name 2(self,argument, …): #Method 2
...
#The constructor is also called an initialization method.
・ Perceptron is an algorithm that has been around for 60 years (invented in 1957). It is the origin of neural networks (deep learning).
-The perceptron receives multiple signals as inputs and outputs one signal. The perceptron signal is a binary value of "flow or not flow" (1 or 0). Each of the plurality of input signals has a unique weight, and the larger the weight, the more important the corresponding signal. A value that controls the importance of the input signal.
-In addition to "input", "weight", and "output", there is an element called "bias". Bias is a parameter that adjusts the degree to which the output signal outputs 1 (neuron firing).
-Since the perceptron is linear, classification such as exclusive OR (XOR) cannot be realized. (Limitation of Perceptron) However, since the perceptron can "stack layers", it is possible to express non-linearity by stacking them. (So to be precise, the limits of a "single-layer" perceptron)
-Although the perceptron has the potential to be expressed by a computer in theory, the work of determining the appropriate weight to satisfy the expected input and output must be done manually. However, neural networks are one of the means to solve this problem, and have the property of being able to automatically learn appropriate weight parameters from data.
-In the neural network (perceptron), the input is weighted and biased, and the sum of these input signals is converted by ** activation function ** and output.
-The activation function is a function that switches the output at the threshold value, and is called a step function or a step function. The perceptron uses a step function.
-In the neural network, ** sigmoid function ** is used as the activation function. The sigmoid function is a smooth curve compared to the step function, and the output changes continuously with respect to the input. This smoothness is the basis of learning neural networks. The common characteristic is that the output can be pushed between 0 and 1 if it is important, a large value if it is not important, and no matter how large the value is.
-Recently, in addition to the sigmoid function, ** ReLU (Rectified Linear Unit) ** is often used.
-It is necessary to use the activation function used in the last output layer properly depending on the task. In general, the softmax function is used for classification problems (guessing which class they belong to), and the identity function is used for regression problems (guessing numbers).
・ The identity function sends the value as it is. The softmax function is characterized by being able to respond ** probabilistically (statistically) to the problem, and the total output is 1. ** (That is, if a = 0.2, b = 0.5, c = 0.3, the probability of a is 20%, the probability of b is 50%, and the probability of c is 30%.)
-In general, the number of neurons in the output layer in classification is set to the number of classes you want to classify. (Set to 10 if the problem is to guess which number from 0 to 9 belongs to)
-Cohesive input data is called a batch. It means a bunch. Calculation can be speeded up by performing inference processing in batch units.
-"Learning" refers to automatically acquiring the optimum weight parameter value from the training data. To enable this learning, ** introduce an index called "loss function". The purpose of learning is to find the weight parameter that has the smallest value based on the loss function.
-Neural networks (deep learning) simply learn the given data and try to discover patterns. Regardless of the target problem, the data can be learned "end-to-end" as raw data as it is.
-** mean squared error **: The most famous loss function. Calculate the square of the difference between each element of the output and the correct teacher data, and calculate the sum.
-** cross entropy error **: The second most commonly used loss function after the above. Calculate the sum of the output multiplied by the correct label. However, since the correct label is represented by a one-hot expression (0 or 1), it is practically only to calculate the natural logarithm of the output corresponding to the correct label 1.
-Since it takes time to obtain the loss function for all data, it is basic to take out a small block called a mini-batch and perform learning for each mini-batch. (I think I saw a similar story while studying statistics.)
-The purpose of learning is to calculate how the loss function changes when the weight parameter is changed a little, and to find a place where the loss is smaller. Here comes the idea of ** differentiation (gradient) **. It is an important property in this learning that the derivative of the sigmoid function does not become 0 at any place.
・ Derivative is the amount of change at a certain moment. Finding the derivative by a small difference is called ** numerical differentiation **, and this is mainly used. It is said that the minute difference h should be 1e-4 (10 to the 4th power, 0.0001). On the other hand, finding by mathematical expansion is called analytical differentiation.
-The collective derivative of all variables as a vector is called ** gradient **. The direction indicated by this gradient is the direction in which the value of the function is reduced most at each location, and the method of finding the minimum value of the function by making good use of this is called the gradient method.
-When expressing the gradient method with a mathematical formula, determine the amount η of how much to learn in one learning and how much to update the parameters. This is called ** learning rate **.
-Parameters such as learning rate are called hyperparameters. This is different from the weights and biases that neural networks can self-learn, and must be set manually.
・ 4 steps for learning neural networks
1 Pick a mini-batch and get a loss function. 2 Find the gradient and find a way to reduce the loss function. 3 Update the weight parameter in the gradient direction. 4 Repeat steps 1 to 3.
The above is called ** stochastic gradient descent (SVD) **.
-By illustrating the loss function and iteration (number of iterations), the transition of the loss function (learning progress) can be visualized.
・ ** Epoch **: 1 An epoch is a unit and corresponds to the number of times when all the data is used up. If you train with 100 mini-batch for 10,000 data, you will see all the training data when you repeat 100 times. In other words, 100 times = 1 epoch.
Recommended Posts