Free study of winter vacation There is also a demo

Motivation

-I read this article and missed doing the same thing when I was a student. -This article seemed to be pretty easy to do.

We aim for purely strong AI rather than learning the rules of reversi. Repository

Topics before DQN

Board surface

The board is 6x6. Originally I was doing it with 8x8, but since the model and training data are both of a reasonable size, about 6x6 was just right for the trial. ~~ Or rather, it happened to go well with 6x6 and I got it ~~

Prepare AI

I want to generate a large amount of game records for learning, so I will make an appropriate AI. Today we will make a completely random and Monte Carlo tree search. And the result is here. This article shows the AI of Monte Carlo tree search with n playouts in MTS (n).

Generation of game records

[Here](http://qiita.com/Ugo-Nama/items/08c6a5f6a571335972d5#%E9%96%A2%E6%95%B0%E8%BF%91%E4%BC%BC%E3%81%AB % E3% 83% 8B% E3% 83% A5% E3% 83% BC% E3% 83% A9% E3% 83% AB% E3% 83% 8D% E3% 83% 83% E3% 83% 88% E3 Learn with something like Neural Fitted Q Iteration introduced in% 83% AF% E3% 83% BC% E3% 82% AF% E3% 82% 92% E4% BD% BF% E3% 81% 86) So I will make a game record. We recorded the board and the progress of the turn on the game record, and prepared about 200,000 stations for the time being. With DQN, you don't have to prepare a good game record, so it's very easy. You can do it just by turning it completely randomly for tens of minutes.

DQN topic

See this article for the basics.

input

DQN takes state s and action ʻa` as inputs. This time, the state is the board, the turn, and the board after taking action. In Atari games, which are common in DQN, factors other than actions (such as the movement of enemies) affect the next state, so actions can only be done by key input. On the other hand, in Reversi, my actions completely determine the next state, so I tried this.

Reward and output

DQN sets a reward r for each action. This time, it was the final result of the action, +1 for winning, -1 for losing, and 0 for draws and before the final. This ensures that the intermediate score is always within [-1, 1] and the output can be tanh. Also, assuming a zero-sum game, black score =-white score is established, and R described later. It will be possible to use it for the calculation of the estimated value of. By the way, the propagation of the Q value is also easy to see.

R estimated value calculation

Use r to calculate the total reward R, which is the teacher signal ([reference](http://qiita.com/Ugo-Nama/items/08c6a5f6a571335972d5#deep%E3%81%98%E3%] 82% 83% E3% 81% AA% E3% 81% 84q-networkq% E5% AD% A6% E7% BF% 92--% E9% 96% A2% E6% 95% B0% E8% BF% 91% E4% BC% BC)).

Q_\theta(s, a) = r + \gamma*\max_{a'}Q_\theta(s',a')

This is the teacher signal formula, but in the case of reversi, black and white alternate, so the value you want is the value you want by simply copying this formula because the turn of s and the turn of s' are different. Will not come out. Here, using the above-mentioned zero-sum assumption, rewrite as follows.

Q_\theta(s, a) = \left\{
\begin{array}{ll}
 r + \gamma*\max_{a'}Q_\theta(s',a') & (s and s'And the turn is the same)\\
 r - \gamma*\max_{a'}Q_\theta(s',a') & (s and s'The turn is different)
\end{array}
\right.

If the turn is different, the sign reversal of the opponent's score will be the own score from the assumption of zero sum, so this formula is used. The minimax method does the same thing, so I came up with it naturally, but I don't know if it matches. Furthermore, since the board surface of Reversi is considered to have the same diagonal biaxial and 180 degree rotated board surface, the idea of increasing the propagation speed by taking max among these four types was also introduced.

The code will be this, but for speedup max The predict for calculation is done all at once, and the ε-greedy code is mixed in, making it terribly difficult to read. (2017/1/1 postscript: I noticed when considering 8 roadbeds, but NFQ does not seem to need ε-greedy.)

model

Taking advantage of Keras, I made the following model.

def create_model6():

    board_input = Input(shape=[6, 6])
    action_input = Input(shape=[6, 6])
    color_input = Input(shape=[1])

    model_v = Sequential([
        InputLayer([6, 6]),
        Reshape([6, 6, 1]),
        Conv2D(64, 6, 1), # 1x6
        ELU(),
        Conv2D(64, 1, 1),
        ELU(),
        Flatten()
    ])

    model_h = Sequential([
        InputLayer([6, 6]),
        Reshape([6, 6, 1]),
        Conv2D(64, 1, 6), # 6x1
        ELU(),
        Conv2D(64, 1, 1),
        ELU(),
        Flatten()
    ])

    model_dr = Sequential([
        InputLayer([6, 6]),
        Reshape([6*6, 1]),
        ZeroPadding1D(3),
        Reshape([6, 7, 1]),
        LocallyConnected2D(64, 6, 1),
        ELU(),
        LocallyConnected2D(64, 1, 1),
        ELU(),
        Flatten()
    ])

    model_dl = Sequential([
        InputLayer([6, 6]),
        Reshape([6*6, 1]),
        ZeroPadding1D(2),
        Reshape([8, 5, 1]),
        LocallyConnected2D(64, 8, 1),
        ELU(),
        LocallyConnected2D(64, 1, 1),
        ELU(),
        Flatten()
    ])

    color_model = Sequential([
        InputLayer([1]),
        Dense(256),
        ELU(),
        Dense(1024),
        ELU()
    ])

    merge_layer = merge([
        model_v(board_input),
        model_h(board_input),
        model_dl(board_input),
        model_dr(board_input),
        color_model(color_input),
        model_v(action_input),
        model_h(action_input),
        model_dl(action_input),
        model_dr(action_input),
    ], mode="concat", concat_axis=-1) 

    x = Dense(2048)(merge_layer)
    x = BatchNormalization()(x)
    x = ELU()(x)
    x = Dense(512)(x)
    x = BatchNormalization()(x)
    x = ELU()(x)
    x = Dense(128)(x)
    x = BatchNormalization()(x)
    x = ELU()(x)
    output = Dense(1, activation="tanh")(x)

    model = Model(input=[board_input, color_input, action_input], output=[output])

    adam = Adam(lr=1e-4)
    model.compile(optimizer=adam, loss="mse")

    return model

First of all, there are two boards, present and post-action, which are input to four types of networks that share weights. model_v and model_h are convolved in 1x1 after the 6x1, 1x6 convolution that I saw. This allows you to calculate by looking at only one vertical row and one horizontal row. The remaining model_dl and model_dr are diagonal column calculations. First, make good use of Zero Padding and Reshape to align the diagonal columns into a vertical column and apply it to the 6x1 Locally Connected 2D. As an example, in model_dr, diagonal columns are aligned as shown in the figure below. スクリーンショット 2016-12-27 15.46.43.png I'm using LocallyConnected2D because these columns show different things. I really want to separate it, but I skipped it. After that, the turn is also increased appropriately, and it becomes a fully connected layer after connecting all the outputs.

Keras has a high degree of freedom in model design, but I'm quite at a loss because it may not be possible to use it by pretending that Model, Sequence and Layer can be used without distinction. The area around the last fully connected layer is the product of that suffering. Also this article, I learned about ʻInputLayer` and made progress, so I hope it will be liked more.

Confirmation of learning progress

I used two methods to check the progress.

Let's actually fight

I made hundreds of battles with random AI for each appropriate number of iterations and saw what the winning percentage was. Actually, the implementation was buggy at the beginning, and when I noticed it and fixed it and started the continuation, the winning percentage was over 99% at that time, so it is not very helpful. I used it to leave the model with the best winning percentage in the learning process.

Give data of the game closest to the best

Let's see what kind of value is obtained by giving the input the game record played between MTS (5000). DQN estimates the score when you keep pointing to the best move, so the idea is to give a score that keeps pointing to the best move with MTS with a large number of playouts. The output at the end of learning was as follows.

 0: B: [-0.02196666 -0.02230018 -0.0303785  -0.03168077]
 1: W: [-0.09994178 -0.10080269 -0.09473281 -0.09523395]
 2: B: [-0.05049421 -0.0613905  -0.05865757 -0.0515893 ]
 3: W: [-0.10307723 -0.11034401 -0.1078812  -0.11545543]
 4: B: [-0.02195507 -0.01722838 -0.00687013 -0.00869585]
 5: W: [-0.128447   -0.12213746 -0.14925914 -0.14207964]
 6: B: [-0.22107203 -0.21318831 -0.1865633  -0.18185341]
 7: W: [ 0.06178221  0.08316899  0.05964577  0.03900897]
 8: B: [-0.22808655 -0.20577276 -0.16608836 -0.18835764]
 9: W: [ 0.0713096   0.08927093  0.05861793  0.0417437 ]
10: B: [-0.25330806 -0.2539109  -0.22571792 -0.20664638]
11: W: [ 0.05460038  0.07116394  0.07360842  0.01370821]
12: B: [-0.24553944 -0.22967289 -0.22448763 -0.2255006 ]
13: W: [ 0.08806669  0.11078909  0.11152182  0.0635582 ]
14: B: [-0.45493153 -0.45988095 -0.46441916 -0.41323128]
15: W: [ 0.37723741  0.37333494  0.36738792  0.32914931]
16: B: [-0.46461144 -0.44231483 -0.42101949 -0.38848421]
17: W: [ 0.17253573  0.21936594  0.20173906  0.16213408]
18: B: [-0.50654161 -0.52144158 -0.50978303 -0.51221204]
19: W: [ 0.42853072  0.47864962  0.42829457  0.39107552]
20: B: [-0.68370593 -0.69842124 -0.69973147 -0.70182347]
21: W: [ 0.39964256  0.50497115  0.51084852  0.49810672]
22: B: [-0.71973562 -0.70337516 -0.62852156 -0.67204589]
23: W: [ 0.37252051  0.49400631  0.34360072  0.35528785]
24: B: [-0.81641883 -0.84098679 -0.79452062 -0.82713711]
25: W: [ 0.80729294  0.81642532  0.79480326  0.78571904]
26: B: [-0.916417   -0.9247427  -0.90268892 -0.89786631]
27: W: [ 0.93264407  0.93837488  0.93335259  0.9382773 ]
28: W: [ 0.93243909  0.93243909  0.9183923   0.9183923 ]

The scores of the first move, the second move, and so on from the top are the same in the horizontal direction, the diagonal axis of the action is reversed, and the rotation is 180 degrees. It can be read that the score at the end of the game is high and that the sign is reversed due to the change of turn. The fact that the values in the horizontal direction are about the same seems to indicate their equality. Is the Q value not propagated in the early stages or is it in an indefinite state in the first place? It seems that the discount rate γ is not very effective ...

Learning results

I turned it for about 3 days with the above feeling. When I bought a new book of Made in Abyss and came back, all the outputs were nan and I got a cold sweat. Above all, I want you to be full of things. Fortunately, I tried to evaluate the winning percentage using the remaining model.

--995/1000 with a random opponent --95 / 100 with MTS (30) --MTS (100) 6/10 with opponent --2/10 with MTS (1000)

Hmm subtle? Random opponents should be able to win if they are good only in the second half tactics, so I thought that it would be easier to increase the winning percentage with DQN that propagates from the second half, but it was a bit that I could win against MTS opponents who should be stronger in the second half. I was surprised. Did you get the upper hand by the middle stage? Without nan, I feel like I can go further and become stronger.

demo

I've always wanted to use GCP, so I tried using the free frame. If it doesn't fall, I think it will work during the year-end and New Year holidays. http://104.155.201.107:8888/ Select DQN in the select box below to get DQN.

Consideration

About the model

Since it is reversi, I designed a model that can consider vertical, horizontal and diagonal. I don't know if I learned the rules for sandwiching, but can I see the degree of openness? As a matter of fact, the value of the action seems to be understood only by the board after the action, so I feel that it was not necessary to include the board in the state. If the result of the action is the same from any state, those scores should be the same. It seems that the model I made properly went well, so I haven't considered enough. However, it is quite difficult to take several days for each study.

About the score

The score towards the end was reversed for each turn, so it seemed that learning was progressing. On the other hand, the absolute value of the score should be in the form of γ ^ n, but it seems to be far from that. What will happen to the true value of R in the first place ... It seems that it is advantageous for the latter, so will black be minus and white be plus from the first move?

TODO --Since I did it with float32, I changed it to float64 to avoid nan and aim for further. ――Try a simpler model. ――Do it with 8x8.

That's it folks!

Write Reversi AI with Keras + DQN