TL; DR
I explained that deep learning is a multi-layered NN, but in reality it is now common to take complicated configurations such as convolution, recursive, and self-attention calculation instead of simple multi-layered learning. I would like to briefly explain machine learning in the context of supervised learning, and then explain which components are deep learning.
Supervised machine learning is represented by the following formula: $ \ boldsymbol {y output from a certain input $ \ boldsymbol {x} $ through a mapping function $ f $ with the parameter $ \ boldsymbol {w} $. Formulate the problem to estimate the parameter $ \ boldsymbol {w'} $ that minimizes the output of the error function $ E $ between the} $ and the true answer $ \ boldsymbol {t} $. can do.
\boldsymbol{y} = f(\boldsymbol{w},\boldsymbol{x}) \\
E(\boldsymbol{y}, \boldsymbol{t})← Parameters that minimize this\boldsymbol{w'}Want to ask
The input value $ \ boldsymbol {x} $ and the target output value $ \ boldsymbol {t} $ must be prepared by Ningen as data [^ 1]. Handwritten text (0-9) Identifying the numbers in the image from the image Assuming a supervised machine learning task, $ \ boldsymbol {x} $ is a vector of the brightness values contained in each pixel of the handwritten text image. I think it would be nice if you could imagine that $ \ boldsymbol {t} $ is a vector that contains only the index of the number +1 in the image, 1 and the rest.
*** Example) Input value $ \ boldsymbol {x} $ when the number in the handwritten character image is 5, target output value $ \ boldsymbol {t} $ ***
\boldsymbol{x} =
\begin{bmatrix}
0.5 \\
0.3 \\
0.2 \\
0.5 \\
...
\end{bmatrix}← Has as many dimensions as the number of pixels, \,
\boldsymbol{t} =
\begin{bmatrix}
0 \\
0 \\
0 \\
0 \\
0 \\
1 \\
0 \\
0 \\
0 \\
0 \\
\end{bmatrix}← Only the 6th index stands 1
How do you find the parameter $ \ boldsymbol {w'} $? There are various optimization techniques, but in the context of deep learning, the (stochastic) gradient descent method (or its variants) is commonly used. In the gradient descent method, the error function $ E $ is differentiated by the parameter $ \ boldsymbol {w} ^ {(t)} $, and the parameter $ \ boldsymbol {w} $ is adjusted in the direction of the slope obtained from it. By repeating the steps such as, find the parameter that minimizes the error.
\boldsymbol{w}^{(t+1)}
=
\boldsymbol{w}^{(t)}
-
\epsilon
\frac{\partial E(\boldsymbol{y}, \boldsymbol{t})}{\partial \boldsymbol{w}^{(t)}},
\,\epsilon:Learning rate, \, t:Number of steps
If the error function $ E $ is a quadratic function or a cubic function in order to get an image of what this is doing, what is the final parameter $ w $ (initial value: 1.0)? I would like you to see if it converges to a certain value.
*** Example) In the case of a quadratic function ($ y = x ^ 2 $) (vertical axis: $ E (w) $, horizontal axis: $ w $, $ w ^ {(0)} $: 1.0, step Number: 25000) ***
*** Example) In the case of a cubic function ($ y = x ^ 3-0.75x + 1 $) (vertical axis: $ E (w) $, horizontal axis: $ w $, $ w ^ {(0)} $: 1.0, number of steps: 25000) ***
In the example of the quadratic function, it converges near the minimum value of 0, while in the example of the cubic function, if you go to the left side, you should be able to go to the minimum value, but with a halfway dent [^ 2] It has converged. As the name of the gradient descent method, the value of $ w ^ {(t + 1)} $ is determined based on the gradient, so the slope is 0 in Kubomi, and $ w ^ {(t + 1)} = This is due to the fact that w ^ {(t)} + 0 $.
\frac{\partial E(w^{(t)})}{\partial w^{(t)}}
← This is 0
There are various methods to avoid this, such as the momentum method and the method of speeding up the learning speed [^ 3], such as Adam and AdaGrad (but if you are interested in this area, please google). Finding (learning) the parameter $ \ boldsymbol {w'} $ that minimizes the error means that the output is close to the target output $ \ boldsymbol {t} $ $ \ boldsymbol {y} = f (\ boldsymbol {w' }, \ boldsymbol {x}) Synonymous with getting $.
By the way, it's a story about deep learning, isn't it? Yes. Deep learning is $ f $ (although it doesn't have to be deep learning). In other words, it is a function that maps the input $ \ boldsymbol {x} $ to the output $ \ boldsymbol {y} $. Well, then are quadratic and cubic functions also deep learning? To all the enthusiastic readers who thought, that is not the case. What is different is that the function structure is different. Deep learning is a composite function [^ 4] with nesting superimposed on nesting, and a huge number of parameters $ \ boldsymbol {W} ^ {(1)}, in the form of multiplying the output of each nested function. It has ..., \ boldsymbol {W} ^ {(n)} $. Also, in the nested function $ \ boldsymbol {a} ^ {(1)}, ..., \ boldsymbol {a} ^ {(n)} $, differentiable functions such as Relu, sigmoid, and tanh [^ 5] A possible non-linear function is used. [^ 6]
\boldsymbol{y}
=
f(\boldsymbol{x})
=
\boldsymbol{W}^{(n)} a^{(n)}(\boldsymbol{W}^{(n-1)}a^{(n-1)}(・ ・ ・(\boldsymbol{W}^{(1)} a^{(1)})・ ・ ・))
You may be wondering why deep learning with such a structure is more accurate than other machine learning methods, but in fact this has not been mathematically proved exactly [^ 7]. However, there is a paper that the local optimum solution is almost the same as the global optimum solution in the deep learning method that has sufficient depth (nesting of 3 layers or more) and has a nonlinear function [^ 8]. What this means is that no matter what the initial value of the parameter is, if it converges by turning the learning loop, the convergence value will have almost the same accuracy as the global optimum solution! Is to say. I'm very happy. Deep learning, great! !! !!
Good news for everyone who is tired of explaining using mathematical formulas that are not so strict. Next, I will explain the non-strict code. The code pasted below uses a three-layer neural network to learn and validate a handwritten character dataset called MNIST. See the code for a detailed explanation. As a point, I am using the Japanese DNN framework Chainer [^ 9], and it is a neural network that makes me confused when I try to write with full scratch, but it is strange that using Chainer is easy with less than 100 lines. I can write the code in. (The same can be achieved with TensorFlow and Pytorch). This is commoditization.
train_test_mnist.py
##-------------------------------------------
##Library import
##-------------------------------------------
import numpy as np
import chainer
from chainer import Chain, Variable
import chainer.functions as F
import chainer.links as L
##-------------------------------------------
##model(3 layer NN)Definition of
##-------------------------------------------
class NN(chainer.Chain):
def __init__(self, n_units, n_out):
super(NN, self).__init__()
with self.init_scope():
self.l1=L.Linear(None, n_units)
self.l2=L.Linear(None, n_units)
self.l3=L.Linear(None, n_out)
#Forward calculation
def __call__(self, x):
h1 = F.relu(self.l1(x)) #Activation function: relu
h2 = F.relu(self.l2(h1)) #Activation function: relu
return F.sigmoid(self.l3(h2)) #Activation function: sigmoid
##-------------------------------------------
##Function that returns the correct answer rate and error value
##-------------------------------------------
def compute_accuracy_loss(model, xs, ts):
ys = model(xs)
loss = F.softmax_cross_entropy(ys, ts) #Error function: cross entropy
ys = np.argmax(ys.data, axis=1)
cors = (ys == ts)
num_cors = sum(cors)
accuracy = num_cors / ts.shape[0]
return accuracy, loss
def main():
##-------------------------------------------
##Settings such as hyperparameters
##-------------------------------------------
n_units = 100 #Number of units in the middle layer
n_out = 10 #Number of dimensions of output vector(0~9 of 10)
n_batch = 100 #Batch size
n_epoch = 100 #Number of learning
##-------------------------------------------
##Model settings
##-------------------------------------------
model = NN(n_units, n_out)
opt = chainer.optimizers.Adam() #Optimization method: Adam
opt.setup(model)
##-------------------------------------------
##Data preparation
##-------------------------------------------
train, test = chainer.datasets.get_mnist()
xs, ts = train._datasets #Training data
txs, tts = test._datasets #test data
##-------------------------------------------
##Learning
##-------------------------------------------
for i in range(n_epoch):
for j in range(600):
model.zerograds()
x = xs[(j * n_batch):((j + 1) * n_batch)]
t = ts[(j * n_batch):((j + 1) * n_batch)]
t = Variable(np.array(t, "i"))
y = model(x)
loss = F.softmax_cross_entropy(y, t) #Error calculation
loss.backward() #Gradient calculation
opt.update() #Parameter update
##-------------------------------------------
##test
##-------------------------------------------
acc_train, loss_train = compute_accuracy_loss(model, xs, ts)
acc_test, _ = compute_accuracy_loss(model, txs, tts)
print("Epoch %d loss(train) = %f, accuracy(train) = %f, accuracy(test) = %f" % (i + 1, loss_train.data, acc_train, acc_test))
if __name__ == '__main__':
main()
When I run this code in my poor local environment [^ 10], I get the following learning results. When the number of learning reaches 100, it goes close to 100%.
Thank you for reading this far. As I mentioned a little above, deep learning, which is being sung in the world, is no longer an elemental technology and is becoming more commoditized (if it is withering). From now on, the key is how to combine deep learning with other CS / Web technologies to create new services and businesses. Personally, I also like the world of the Web, so I would like to create a Web service that uses this deep learning in the future. If you have similar interests, let's do it with me, who is basically me! We are recruiting! !!
[^ 1]: A large number of combinations of $ \ boldsymbol {x} $ and $ \ boldsymbol {t} $ are called training data. [^ 2]: In high-dimensional space, the gradient becomes 0 even at the point called saddle point. [^ 3]: Converge with a small number of steps [^ 4]: Since the parameters are calculated by the gradient descent method, the minimum requirement is that they are differentiable. [^ 5]: [Activation function](https://ja.wikipedia.org/wiki/%E6%B4%BB%E6%80%A7%E5%8C%96%E9%96%A2%E6% Called 95% B0) [^ 6]: The problem in the world is [Linear separability](https://ja.wikipedia.org/wiki/%E7%B7%9A%E5%BD%A2%E5%88%86%E9%9B% A2% E5% 8F% AF% E8% 83% BD) It's not that simple [^ 7]: It can be said that the accuracy has come out in various fields empirically. [^ 8]: Citation needed (where was the paper ... the one published by the MIT person) [^ 9]: Thank you Chainer. And goodbye. Lead Developer Blog [^ 10]: Why not use ResNet or a typical deep learning method for image recognition? My MacBook Pro is afraid to catch fire. Because there is.
Recommended Posts