This is a continuation of this article. This time, Othello will be over.
Othello-From the tic-tac-toe of "Implementation Deep Learning" (1) http://qiita.com/Kumapapa2012/items/cb89d73782ddda618c99 Othello-From the tic-tac-toe of "Implementation Deep Learning" (2) http://qiita.com/Kumapapa2012/items/f6c654d7c789a074c69b Othello-From the tic-tac-toe of "Implementation Deep Learning" (3) http://qiita.com/Kumapapa2012/items/3cc20a75c745dc91e826
In the previous article, when Leaky ReLU was used, the winning percentage was stable for the 6x6 board Othello, but there was a large fluctuation in the 8x8 board, and the winning percentage was on a downward trend. I changed the slope of Leaky ReLU again and did this this time.
The code is here. ~~, but the code I put is not the Leaky ReLU version. I will update it at a later date (^^; ~~ Added. https://github.com/Kumapapa2012/Learning-Machine-Learning/tree/master/Reversi
For the Othello 8x8 board, the execution result with Leaky ReLU with Slope = 0.2 shown last time is as follows.
This time, the execution result with Slope = 0.1 is as follows.
By setting Slope = 0.1, it seems that it has converged, but the sharp drop in the winning percentage around Eps. 20000 is not solved. This is [not seen in the 6x6 disc results](http://qiita.com/Kumapapa2012/items/3cc20a75c745dc91e826#leaky-relu-%E3%82%92%E4%BD%BF%E3%81%A3% E3% 81% A6% E3% 81% BF% E3% 81% 9F), which is an interesting situation.
To be honest, you can't tell just by looking at the fluctuations in winning percentage. I'm not sure if it's a learning rate issue. Plot the Loss output (square error between teacher data and calculated data) output after 5000 steps for clues.
It seems to be related to the winning percentage. I will repeat this.
Looking at the graph, it seems that there is a connection between the increase in Loss and the sharp drop in winning percentage. If you try to interpret and classify this image as an amateur:
a) Near 16000 (480k Steps): Loss is very low, but the winning percentage is also low (50%). From here, the winning percentage starts to rise. At this point, ε of ε-Greedy is already the lowest value of 0.001, and the position of the frame is determined by the Q value. b) Near 16000-22000 (660k Steps): Loss increased slightly as the winning percentage increased. And from the middle, the winning percentage has dropped sharply. In the model at this point, the more you learn, the more you lose. The model seems to be collapsing. c) Around 22000-27000 (720k Steps): A relatively low value of Loss occurs constantly, and a low win rate state continues. If you don't win, there is no reward, so there is almost no reward during this time. d) Near 27000-30000 (900k Steps): Loss expands again. Learning seems to be going well this time, and the winning percentage is rising. e) Near 30000-35000 (1050k Steps): Once Loss drops, the winning percentage continues to rise. It looks like learning is going well. f) Near 35000-45000 (1350k Steps): Loss expands again. Last time, this was the valley of the second win rate. However, this time the winning percentage will not decrease. Is Loss working in a positive direction for learning, or model correction? g) Near 45000-48000 (1440k Steps): Loss decreased. The winning percentage is also stable. h) After 48000: Loss expands again. However, there is a sign that the winning percentage will converge.
Loss expansion is a sign that the model is changing, that is, the agent is growing. If the interpretation of this situation is correct, it can be said that the agent was growing in the wrong direction around b). This time, I saved all aspects as text, so let's look back on this assumption. In the early stages, let's take a look at the following aspects, which are likely to be the difference between victory and defeat.
[[ 0 0 0 0 0 0 0 0]
[ 0 0 0 (0) 0 (0)(0) 0]
[ 0 (0)(0)-1 (0)-1 0 0]
[(0)-1 -1 -1 -1 1 0 0]
[ 0 0 0 1 1 0 0 0]
[ 0 0 0 0 1 0 0 0]
[ 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0]]
This is the phase in which the agent will place a piece. In this phase, the agent (although hard to see) can place the pieces in the parenthesized areas. A search of this aspect with pcregrep found it in 622 episodes out of 50000 [^ 1].
Between the 20000 and 30000 episodes, which hit the valley of winning percentages, this phase was manifested five times: The numbers in parentheses are the wins and losses of that episode.
21974(win) 22078(lose) 22415(lose) 29418(lose) 29955(win)
In the above situation, the four except 29955 hit as follows. Let this be move A.
[[ 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0]
[ 0 0 0 -1 0 -1 0 0]
[(1) 1 1 1 1 1 0 0]
[ 0 0 0 1 1 0 0 0]
[ 0 0 0 0 1 0 0 0]
[ 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0]]
29955 struck: Let's call this move B.
[[ 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0]
[ 0 0 0 -1 (1)-1 0 0]
[ 0 -1 -1 -1 1 1 0 0]
[ 0 0 0 1 1 0 0 0]
[ 0 0 0 0 1 0 0 0]
[ 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0]]
I haven't seen all the episodes since 29955, but I've hit B in every phase I've seen, up to 50,000, where the winning percentage is high and stable.
Strike A is the move that can take the most pieces in this situation. When the winning percentage was the lowest, the action to take this piece occurred 4 times out of 5, so it is probable that the agent at this time was more likely to take the action of always taking more pieces. .. However, it is said that the action of taking a lot of pieces in the early stages is not good for Othello's strategy. In the early stages, the pieces that can be taken are the smallest, such as move B, and the part as far from the "edge" as possible will be effective later. However, the action of "taking a lot of frames" was an action implemented by the self-made Othello agent so that it would always be carried out at a rate of 80%. [^ 2] The agent seems to have learned this move.
Since Othello is a game that starts from the center of the board, the action of always taking a lot of pieces is thought to lead to actively taking the "edge" far from the center in the early stages. It is probable that the act of actively taking the "edge" triggered a chance to take the "horn", the environment took the "horn", and the agent continued to lose. This is because, before the loss continued, the probability that the agent and the environment would take the "corner" was almost the same during the period when the winning percentage was about 50%, but as soon as the agent tended to take the "edge", the environment became a little stronger. It can be said that the probability that the player will take the "horn" has increased, making it easier for the environment to win. The wider the board, the more likely it is that the "corners" will be removed when the "edges" are removed. Therefore, it is probable that the 6x6 board did not have a dip, and only the 8x8 board had a dip. ..
Therefore, I would like to conclude that this decrease in winning percentage is due to the fact that the agent was influenced by specific behaviors in the environment and fell into a state similar to "overfitting" and learned the wrong strategy. .. But how can the effect of Leaky ReLU's Slope be interpreted? .. .. I would like to continue studying and thinking.
Anyway, to avoid this situation, like Alpha Go, you should do supervised learning and study well before playing against the environment, and you will grow in the wrong direction and the reward will decrease. If so, it may be useful to say, for example, to increase the ε of ε-Greedy in order to accelerate the metabolism of the model.
I'd like to do something else, so I'll finish Othello this time.
What happens if Othello creates a problem by himself and puts it into reinforcement learning? I started with the interest. Since it is not the purpose to create a strong Othello agent, we do not do any supervised learning about the agent, but only deal with the appropriately created and self-made environment. From this "blank sheet" state, in reinforcement learning that deals only with the environment, the behavior of the environment has a strong influence on the model. Rather, it has nothing to do with it. For this reason, we have seen the result that if the implementation of the environment is a little bad like this time, the learning of the agent may temporarily go in an unintended direction. However, we can also see that the agents make self-corrections and eventually increase the winning percentage. Is this self-correction the true value of reinforcement learning? [^ 3] It's a little emotional interpretation, but even with strange parents (environments), children (agents) can grow up properly. I feel like (^^ ;. [^ 4]
In this Othello, it took more than 10 hours for the 6x6 board and more than 24 hours for the 8x8 board to run 50000 episodes. In addition, the 8x8 board doesn't work on my home Pascal GeForce 1050GTX (2GB) due to lack of memory, so I have to run it on the Maxwell Tesla M60 (8GB) on Azure NV6, which is a bit slower than my home, thanks to Azure this month. The billing amount has already exceeded 10,000 yen. It's hard to try any more. This is also one of the reasons for quitting Othello this time.
Oh, I want an 8GB 1070 or 1080 GTX. .. .. [^ 5]
● Othello / Reversi winning method ○ http://mezasou.com/reversi/top27.html
[^ 1]: All aspects are about 1.6 million. It seems that the same situation will occur more if you rotate, transpose, etc., but for the time being. [^ 2]: 2nd [^ 3]: The main premise is that the reward is reasonable. [^ 4]: This way of thinking may be a squirt from a professional perspective, but please forgive me for being unstudied (sweat) [^ 5]: This would be great!
Recommended Posts