This is a continuation of this article. Othello-From the tic-tac-toe of "Implementation Deep Learning" (1) http://qiita.com/Kumapapa2012/items/cb89d73782ddda618c99 Othello-From the tic-tac-toe of "Implementation Deep Learning" (2) http://qiita.com/Kumapapa2012/items/f6c654d7c789a074c69b

Click here for subsequent articles. Othello-From the third line of "Implementation Deep Learning" (4) [End] http://qiita.com/Kumapapa2012/items/9cec4e6d2c935d11f108

I mentioned the activation function in the first article, but given the potential for dying ReLU, let's play an Othello game using perhaps the easiest and fastest way to get around this, Leaky ReLU. I tried it. The code is here. https://github.com/Kumapapa2012/Learning-Machine-Learning/tree/master/Reversi

Leaky ReLU ReLU is an activation function that sets all values less than 0 to 0.

f = \max(0,x)

This NN is a full join, but as described previously, it can cause a problem called dying ReLU. One solution to this is Leaky ReLU, which gives a small slope to negative values (default 0.2 for chainer).

{f = \begin{cases}
    x & (x>0) 
    \\
    0.2x  & (x<=0)
  \end{cases}
}

This eliminates the zero slope. This is a personal interpretation, but dying ReLU is basically due to the fact that the slope of the negative value is 0, so you can add a slope. However, because we want to keep the features of ReLU that "positive slope is 1 and negative slope is 0", which makes differentiation easy and calculation and learning (backpropagation) fast, the slope is small. , I think it is a feature of Leaky ReLU.

I tried using Leaky ReLU.

Change the activation function from ReLU to Leaky ReLU by changing only 8 lines of code in agent.py.

$ diff ~/git/Learning-Machine-Learning/Reversi/agent.py agent.py 
47,55c47,55
<         h = F.relu(self.l1(x))
<         h = F.relu(self.l20(h))
<         h = F.relu(self.l21(h))
<         h = F.relu(self.l22(h))
<         h = F.relu(self.l23(h))
<         h = F.relu(self.l24(h))
<         h = F.relu(self.l25(h))
<         h = F.relu(self.l26(h))
<         h = F.relu(self.l27(h))
---
>         h = F.leaky_relu(self.l1(x))   #slope=0.2(default)
>         h = F.leaky_relu(self.l20(h))
>         h = F.leaky_relu(self.l21(h))
>         h = F.leaky_relu(self.l22(h))
>         h = F.leaky_relu(self.l23(h))
>         h = F.leaky_relu(self.l24(h))
>         h = F.leaky_relu(self.l25(h))
>         h = F.leaky_relu(self.l26(h))
>         h = F.leaky_relu(self.l27(h))

As a result, the winning percentage has increased steadily on the 6x6 board. ** When using Leaky ReLU (slope = 0.2) **

It is quite different from the previous result. After all, was dying ReLU occurring? ** When using ReLU **

Next, in the case of 8x8 board ... The winning percentage was not stable / (^ o ^)
** When using Leaky ReLU (slope = 0.2) **

In the first result, the winning percentage seems to be converging in the end. ** When using ReLU **

If you think very simply, if the winning percentage is ReLU, that is, Leaky ReLU with Slope = 0, it seems to converge, and if it does not converge when Leaky ReLU with Slope = 0.2, there may be an optimum value in the meantime. Maybe. I would like to try it later with Slope = 0.1. But the bigger problem is that there is a wave of winning percentages. Rippling seems to mean that learning doesn't stop at the right place. This seems to be related to the learning rate. According to Chapter 6 of the book "Deep Learning from Zero", the learning rate is basically a coefficient that indicates the degree of update of the weight W. The higher the coefficient, the greater the degree of update of W, and the faster the learning progresses, but it diverges. There is a possibility [^ 1]. However, if it is too small, learning will be too slow. That is. The argument lr (Learning Rate = learning rate) of RMSPropGraves used this time is 0.00025. In RMSPropGraves of chainer, the default lr is 0.0001, so this sample is a little larger. Probably this 0.00025 is a value optimized for the learning speed of the sample tic-tac-toe, and in the case of the 8x8 board of Othello this time, the value of W is not stable, and as a result, the winning rate is as shown in the above graph. It is thought that it has become unstable. For this reason, I would like to try setting a low learning rate in the future. [^ 2]

References

Computer Othello https://ja.m.wikipedia.org/wiki/%E3%82%B3%E3%83%B3%E3%83%94%E3%83%A5%E3%83%BC%E3%82%BF%E3%82%AA%E3%82%BB%E3%83%AD
Talk about failure experiences and anti-patterns on Neural Networks http://nonbiri-tereka.hatenablog.com/entry/2016/03/10/073633 (Others will be added at a later date)

[^ 1]: The expansion of weight fluctuations due to a large learning rate can also be a factor in causing dying ReLU. [^ 2]: In addition, is the activation function of the output layer the same as the hidden layer in the first place? Should I think about it separately? I'm also worried about that.

Othello-From the tic-tac-toe of "Implementation Deep Learning" (3)

I tried using Leaky ReLU.

References