The previous article is here. In this article we will look at the implementation of learning rules. Even so, the main body has already been implemented at here, so please have a look.
The next article is here
-[Learning Rule](# Learning Rule)
-[Learning Law Theory](#Learning Law Theory)
-[Implementation of learning rules](#Implementation of learning rules)
-[Implementation of __init__
method](Implementation of #init method)
-Conclusion
First of all, I will think with scalar as usual. Neuron objects have variables $ w and b $.
y = \sigma(xw + b)
Here, assuming that the input $ x $ is a constant
y = f(w, b) = \sigma(xw + b)
Can be written as In other words, the goal of the learning rule is to change the values of $ w and b $ appropriately to get closer to the target value $ y ^ {\ star} $. </ font>
Let's look at it theoretically.
y = f(w, b) = \sigma(wx + b)
In
\begin{align}
y &= y^{\star} = 0.5 \\
x &= x_0 = 1
\end{align}
Then, if the activation function is the sigmoid function and the parameter space of $ w and b $ is illustrated, it will be as follows.
show_loss_space.py
%matplotlib nbagg
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
x_0 = 1
y_star = 0.5
sigma = lambda x: 1/(1+np.exp(-x))
w = np.arange(-2, 2, 1e-2)
b = np.arange(-2, 2, 1e-2)
W, B = np.meshgrid(w, b)
y = 0.5*(sigma(x_0*W + B) - y_star)**2
elevation = np.arange(np.min(y), np.max(y), 1/2**8)
fig = plt.figure()
ax = fig.add_subplot(111, projection="3d")
ax.set_xlabel("w")
ax.set_ylabel("b")
ax.set_zlabel("loss")
ax.view_init(60)
ax.grid()
ax.plot_surface(W, B, y, cmap="autumn", alpha=0.8)
ax.contour(W, B, y, cmap="autumn", levels=elevation, alpha=0.8)
fig.suptitle("Loss space")
fig.show()
fig.savefig("Loss_space.png ")
The figure shows the loss space when the squared error is used for the loss function. The learning rule is that the random initial value $ w_0, b_0 $ gradually approaches $ w ^ {\ star}, b ^ {\ star} $, which gives the optimum value $ y ^ {\ star} $. That's the purpose of. </ font> At this time, the learning rule is the ** gradient descent method **, which is an evolution of the ** steepest descent method **. Gradient descent method is a method of going down a slope using ** gradient (partial differential) at a certain point of each parameter. Here has formulas and codes for each method. Also, here shows the descent in some search spaces. <img src=https://qiita-user-contents.imgix.net/https%3A%2F%2Fqiita-image-store.s3.ap-northeast-1.amazonaws.com%2F0%2F640911%2F6f55cfbc-4a9f-d4fe-c70e-49dca7cbf683.gif?ixlib=rb-1.2.2&auto=format&gif-q=60&q=75&s=02b37020417dbead312cc8c82f5eac7e>
This article deals with the simplest SGD. The formula for SGD is as follows.
\begin{align}
g_t &= \nabla_{w_t}\mathcal{L}(w_t, b_t) \\
\Delta w_t &= - \eta g_t \\
w_{t+1} &= w_t + \Delta w_t
\end{align}
This formula only lists $ w $, but you can see that if you replace it with $ b $, the learning rule for bias will be the same. Think of $ \ mathcal {L} (w_t, b_t) $ as the loss function and $ \ nabla_ {w_t} $ as the partial derivative with respect to $ w_t $ (the above formula is a matrix representation). </ font> Expressing the above formula in Japanese
--Find the gradient with partial differential --Calculate the amount of movement
It looks like. It's simple. Let's take a closer look.
The first "finding the gradient with partial differential" uses the error backpropagation method introduced in Backpropagation. That's fine. "Move" is also literal.
Regarding the part of "calculating the amount of movement"
I would like to talk about two points.
First of all, regarding 1., I think this is easy to understand if you actually think about it concretely. For example, the slope at the point of $ (x, y) = (1, 1) $ is $ 2 $, but the direction you want to move is the minus direction, isn't it? Of course, the reverse is also true. Therefore, the direction and gradient you want to move are always opposite signs, so they are negative. As for 2., as you can see from the graph, if you use the slope $ 2 $ as it is and set $ \ Delta x = -2 $ etc., it will be $ x = -1 $ and it will pass the optimum value. .. </ font> Therefore, we multiply the coefficient $ \ eta \ ll 1 $ called the learning rate to limit the amount of movement so that it gradually falls toward the optimum value. This learning rate is a value called ** hyperparameter **, and there are many learning rules that humans have to design this part. In most cases, using the default values given in papers will work, but depending on the problem you want to solve, you may have to experiment.
Well, aside from the details, let's implement it for the time being. The implementation destination is as usual [baselayer.py](https://qiita.com/kuroitu/items/884c62c48c2daa3def08#%E3%83%AC%E3%82%A4%E3%83%A4%E3%83%BC % E3% 83% A2% E3% 82% B8% E3% 83% A5% E3% 83% BC% E3% 83% AB% E3% 81% AE% E3% 82% B3% E3% 83% BC% E3 % 83% 89% E6% BA% 96% E5% 82% 99).
baselayer.py
def update(self, **kwds):
"""
Implementation of parameter learning
"""
dw, db = self.opt.update(self.grad_w, self.grad_b, **kwds)
self.w += dw
self.b += db
The part of self.opt.update (self.grad_w, self.grad_b, ** kwds)
is thrown to here. Here is the SGD code as an example.
optimziers.py
import numpy as np
class Optimizer():
"""
A superclass inherited by the optimization method.
"""
def __init__(self, *args, **kwds):
pass
def update(self, *args, **kwds):
pass
class SGD(Optimizer):
def __init__(self, eta=1e-2, *args, **kwds):
super().__init__(*args, **kwds)
self.eta = eta
def update(self, grad_w, grad_b, *args, **kwds):
dw = -self.eta*grad_w
db = -self.eta*grad_b
return dw, db
The content of the code is exactly as the formula introduced above. It receives the gradient about $ w and b $ from the outside, and according to the learning rule, it is multiplied by $-\ eta $ to determine the movement amount and throw it back. </ font> The layer object receives this amount of movement and updates its parameters.
Well, that's all for this time. You might think, "What? In the case of a procession?" In fact, the code is exactly the same for matrices. [optimizers.py](https://qiita.com/kuroitu/items/36a58b37690d570dc618#%E5%AE%9F%E8%A3%85%E3%82%B3%E3%83%BC%E3%83%89 Even if you look at% E4% BE% 8B), you can't find the calculation of matrix product. The reason is natural when you think about it, but even if you learn with a mini-batch, the gradient that flows should be unique to each parameter, and it is necessary to calculate by involving the gradients of other parameters. Because there is no such thing. So, this time, if you think about it with a scalar and implement it, you can calculate it with a matrix in the same way.
__init__
methodFinally, let's make the layer object have the optimizer ʻoptwith the
init` method.
__init__.py
def __init__(self, *, prev=1, n=1,
name="", wb_width=1,
act="ReLU", err_func="square", opt="Adam",
act_dict={}, opt_dict={}, **kwds):
self.prev = prev #Number of outputs of the previous layer=Number of inputs to this layer
self.n = n #Number of outputs in this layer=Number of inputs to the next layer
self.name = name #The name of this layer
#Set weight and bias
self.w = wb_width*np.random.randn(prev, n)
self.b = wb_width*np.random.randn(n)
#Activation function(class)Get
self.act = get_act(act, **act_dict)
#Loss function(class)Get
self.errfunc = get_errfunc(err_func)
#Optimizer(class)Get
self.opt = get_opt(opt, **opt_dict)
Next time, I will introduce the activation function, the localization of the optimizer, and the loss function.
-Introduction to Deep Learning ~ Basics ~ -Introduction to Deep Learning ~ Coding Preparation ~ -Introduction to Deep Learning ~ Forward Propagation ~ -Introduction to Deep Learning ~ Backpropagation ~ -Introduction to Deep Learning ~ Learning Rules ~ -Introduction to Deep Learning ~ Localization and Loss Functions ~ -List of activation functions (2020) -Gradient descent method list (2020) -See and understand! Comparison of optimization methods (2020) -Thorough understanding of im2col -Complete understanding of numpy.pad function
Recommended Posts