This is an introductory article about Deep Learning, which is popular these days. Deep Learning already has abundant open source libraries, and this time we will use chainer, which has a reputation for domestic production, and in an article it was stated that GPU calculation is relatively fast at present.
However, there are many introductory articles on Chainer, but most of them end up running a sample of mnist that recognizes handwriting. Certainly, if you look at the mnist sample, you can understand how to use Chainer, but somehow it is different from being able to build it yourself, so this time it is up to the point that you can build Deep Learning using chainer yourself. I will do it with the goal.
・ OS: Mac OS X EL Capitan (10.11.5) · Python 2.7.12: Anaconda 4.1.1 (x86_64) ・ Chainer 1.12.0
If you have not prepared the chainer environment,
$ pip install chainer
You can easily install it at.
Here, it is about the forward and reverse calculation using variables, which is the stage before stepping into the Neural Network.
First, load chainer and declare variables.
>>> import chainer
>>> x_data = np.array([5], dtype=np.float32)
>>> x_data
array([ 5.], dtype=float32)
Basically, it seems to declare it as a float type of an array of numpy.
Use `` `chainer.Variable``` as a variable for use within chainer.
>>> x = chainer.Variable(x_data)
>>> x
<variable at 0x10b796fd0>
You can check the value of x in `` `.data```.
>>> x.data
array([ 5.], dtype=float32)
Next, declare the function y of x.
This time, we will use the following function.
>>> y = x ** 2 - 2 * x + 1
>>> y
<variable at 0x10b693dd0>
You can check the value of y in the same way.
>>> y.data
array([ 16.], dtype=float32)
By calling the following method, it will be possible to calculate the derivative.
>>> y.backward()
The gradient for back-propagation is `` `grad```.
>>> x.grad
array([ 8.], dtype=float32)
It is a little difficult to understand which gradient it is for, but the value of the gradient when y is differentiated by x.
y'(x) = 2x - 2\\
\rightarrow \ y'(5) = 8
The value `8``` of
`` x.grad``` is derived from.
In Chainer's official reference, if x is a multidimensional array, initialize `` `y.gradAfter that, it says to calculate
x.grad```.
If you do not initialize it, it will be added to the array where the gradient values are stored, so remember that "initialize before gradient calculation".
>>> x = chainer.Variable(np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float32))
>>> y = x**2 - 2*x + 1
>>> y.grad = np.ones((2, 3), dtype=np.float32)
>>> y.backward()
>>> x.grad
array([[ 0., 2., 4.],
[ 6., 8., 10.]], dtype=float32)
When constructing a Neural Network, explicitly declare what kind of structure model to configure (specifically, how many nodes and how many layers). The decision on this model still depends on experience and intuition. Neural Networks including Deep Learning automatically adjust the internal hyperparameters, but the first model needs to be decided in advance. A little off the beaten track, Bayesian statistics are also designed so that the Monte Carlo Markov Chain (MCMC) can successfully estimate posterior distributions based on Bayes' theorem, but even then the prior distributions must be determined arbitrarily. .. Whether it's Neural Network or Bayesian statistics, I hope that if an epoch-making method to solve this area is proposed, a predictive model can be constructed so that it can respond universally to any problem.
Let's get back to the story, but let's make it possible to call chainer by abbreviation in the python code.
>>> import chainer.links as L
As anyone studying Neural Networks knows, there is a parameter called weight between these nodes. For now, let's try the simplest linear combination pattern.
>>> f = L.Linear(3, 2)
>>> f
<chainer.links.connection.linear.Linear object at 0x10b7b4290>
This shows a structure with three layers of inputs and two layers of outputs.
linear
To briefly explain the part of, it means that the nodes are connected by the linear combination mentioned earlier, so it is expressed by the following relational expression.
f(x) = Wx + b\\
f \in \mathcal{R}^{2 \times 1},
x \in \mathcal{R}^{3 \times 1},\\
W \in \mathcal{R}^{2 \times 3}, b \in \mathcal{R}^{2 \times 1}
Therefore, although not explicitly declared, the `f``` declared above has the parameters
W``` and the weight vector ``
b```. I am.
>>> f.W.data
array([[-0.02878495, 0.75096768, -0.10530342],
[-0.26099312, 0.44820449, -0.06585278]], dtype=float32)
>>> f.b.data
array([ 0., 0.], dtype=float32)
If you implement it without knowing the internal specifications around here, it will be incomprehensible. By the way, even though I don't remember initializing the weight matrix `` `W```, it has a value at random when the Linear link is declared due to the specifications of chainer. It seems that it is because it is shaken.
So, as you can see in the official chainer documentation, this is the format you use most often.
>>> f = L.Linear(3, 2)
>>> x = chainer.Variable(np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float32))
>>> y = f(x)
>>> y.data
array([[ 1.15724015, 0.43785751],
[ 3.0078783 , 0.80193317]], dtype=float32)
You can see that x, which was a three-dimensional vector, is converted to two-dimensional y by a linear combination. At this time, the initial value is automatically assigned to the weight matrix W internally, so this calculation can be done without throwing an error.
For confirmation, when I looked at the weight value, the initial value was certainly assigned.
>>> f.W.data
array([[-0.02878495, 0.75096768, -0.10530342],
[-0.26099312, 0.44820449, -0.06585278]], dtype=float32)
>>> f.b.data
array([ 0., 0.], dtype=float32)
Next, we will calculate the gradient learned in the previous chapter. The official documentation for chainer has a lot of emphasis on notes, but the value of each gradient accumulates with each calculation. Therefore, you usually need to initialize the gradient value to 0 with the following method before calculating the value for each gradient.
>>> f.zerograds()
Make sure the gradient values are initialized correctly.
>>> f.W.grad
array([[ 0., 0., 0.],
[ 0., 0., 0.]], dtype=float32)
>>> f.b.grad
array([ 0., 0.], dtype=float32)
Now let's calculate the value for each gradient.
>>> y.grad = np.ones((2, 2), dtype=np.float32)
>>> y.backward()
>>> f.W.grad
array([[ 5., 7., 9.],
[ 5., 7., 9.]], dtype=float32)
>>> f.b.grad
array([ 2., 2.], dtype=float32)
You can calculate it properly.
We will expand the model explicitly defined in the previous chapter in multiple layers.
>>> l1 = L.Linear(4, 3)
>>> l2 = L.Linear(3, 2)
For the time being, let's check the weight of each model.
>>> l1.W.data
array([[-0.2187428 , 0.51174778, 0.30037731, -1.08665013],
[ 0.65367842, 0.23128517, 0.25591806, -1.0708735 ],
[-0.85425782, 0.25255874, 0.23436508, 0.3276397 ]], dtype=float32)
>>> l1.b.data
array([ 0., 0., 0.], dtype=float32)
>>> l2.W.data
array([[-0.18273738, -0.64931035, -0.20702939],
[ 0.26091203, 0.88469893, -0.76247424]], dtype=float32)
>>> l2.b.data
array([ 0., 0.], dtype=float32)
The structure of each model is defined above. Next, we will clarify the overall structure, such as how the models that define those structures are connected.
>>> x = chainer.Variable(np.array([[1, 2, 3, 4], [5, 6, 7, 8]], dtype=np.float32))
>>> x.data
array([[ 1., 2., 3., 4.],
[ 5., 6., 7., 8.]], dtype=float32)
>>> h = l1(x)
>>> y = l2(h)
>>> y.data
array([[ 1.69596863, -4.08097076],
[ 1.90756595, -4.22696018]], dtype=float32)
To make them reusable, the official documentation recommends creating classes as follows:
MyChain.py
# -*- coding: utf-8 -*-
from chainer import Chain
import chainer.links as L
class MyChain(Chain):
def __init__(self):
super(MyChain, self).__init__(
l1 = L.Linear(4, 3),
l2 = L.Linear(3, 2) )
def __call__(self, x):
h = self.l1(x)
return self.l2(h)
Next, we will optimize the weights of the Neural Network model. There are several methods for optimizing this weight, but honestly, there seems to be no clear criterion for which one to use, so here we use the Stochastic Gradient Descent (SGD) method. To do. The difference in performance depending on the optimization method is explained in Which optimization method shows the best performance for learning CNN. I have received it.
>>> model = MyChain()
>>> optimizer = optimizers.SGD() #Designate optimization method as SGD
>>> optimizer.setup(model)
>>> optimizer
<chainer.optimizers.sgd.SGD object at 0x10b7b40d0>
At this time, you can pass the parameter information of the model by optimizer.setup (model)
.
The official documentation states that there are two ways to optimize. In the first method, the gradient value is calculated manually, and the manual calculation of the gradient is quite difficult. So, except in special cases, use another method, such as calculating the gradient automatically. If you want it to be calculated automatically, you need to define a loss function in advance.
Details will be introduced next time in "Introduction to Deep Learning (2) --Let's try non-linear regression with Chainer-", but each one will introduce the loss function by themselves. Define. When dealing with real numbers, it can be defined as a problem that minimizes the sum of the two norms of the least squares method, and it seems that it is often defined as a problem that minimizes cross entropy. For the loss function, various types are explained in Notes on Backpropagation Method.
Loss function
def forward(x, y, model):
loss = ... #Define your own loss function
return loss
This time it is assumed that the forward
function that calculates the loss function takes the arguments
x,
y, and
model```.
If you define such a loss function, the parameters will be optimized as follows.
optimizer.update(forward, x, y, model)
We are waiting for you to follow us! Qiita: Carat Yoshizaki twitter:@carat_yoshizaki Hatena Blog: Carat COO Blog Home page: Carat
Tutor service "Kikagaku" where you can learn machine learning one-on-one Please feel free to contact us if you are interested in "Kikagaku" where you can learn "Mathematics-> Programming-> Web Applications" at once.
Recommended Posts