Series table of contents

-I tried to make Othello AI with tensorflow without understanding the theory of machine learning-Introduction- --I tried to make Othello AI with tensorflow without understanding the theory of machine learning-Implementation- -I tried to make Othello AI with tensorflow without understanding the theory of machine learning-Iza battle- -I tried to make Othello AI after trying to understand the theory of machine learning ~ Restart! ~ -I tried to make Othello AI after trying to understand the theory of machine learning-What is this Alpha Zero edition- -I tried to make a neutral network with Excel to understand the theory of machine learning ~ Image recognition mnist edition ~

Last time continued ... In this field, as an outsider, I didn't study "theory of machine learning" at all. I would like to make an AI for Othello. Click here for the referenced site ・ Implement DQN with Keras, TensorFlow and OpenAI Gym ・ Training TensorFlow neural network to play Tic-Tac-Toe game using one-step Q-learning algorithm.

Basics of reinforcement learning

I made Othello's AI without studying "machine learning theory" at all. Here is a summary of the minimum knowledge required to implement it.

File structure and role

The file structure and role are like this. 構成.png --train.py --- AI training --Reversi.py --- Management of Othello games --dqn_agent.py --- Management of AI training --FightWithAI.py --- Battle with users

Overall algorithm

The DQN algorithm implemented this time looks like this.

If you keep this flow in mind, you will understand what you are talking about and what you are going to explain.

Othello game specifications

The board used for Othello games and AI training This is done using a two-dimensional array with No in the figure below.

`Reversi.py`


self.screen[0～7][0～7]

The action that AI can select is to select the number from 0 to 63 in the above figure.

`Reversi.py`


self.enable_actions[0～63]

AI training

In AI training, players [0] and players [1] played Othello battles n_epochs = 1000 times. Finally, save the AI of the second player [1].

Reward for AI

--If you win the game, set reward = 1. --Other than that, reward = 0

Training method

I will play with two AIs, but I have to act even on the opponent's turn Because the story until the end is not connected (Q value is not transmitted)

Both act on every turn This time, I decided to "save the transition in D" for all the numbers that can be set separately from the progress of the game.

`train.py`



#targets contains all the numbers you can put this turn
for tr in targets:
    #Duplicate the status quo
    tmp = copy.deepcopy(env)
    #Action
    tmp.update(tr, playerID[i])
    #End judgment
    win = tmp.winner()
    end = tmp.isEnd()
    #Board after action
    state_X = tmp.screen
    #A number that you can leave after you act
    target_X = tmp.get_enables(playerID[i+1])
                       
    #Both actions
    for j in range(0, len(players)):
        reword = 0
        if end == True:
            if win == playerID[j]:
                #If you win, you get 1 reward
                reword = 1
        #Both "Save transition in D"
        players[j].store_experience(state, targets, tr, reword, state_X, target_X, end)
        players[j].experience_replay()

The following part of the DQN algorithm is done by dqn_agent.py.

--Save transition to D (, ai, ri, si + 1, terminal) --Sample mini-patches (si, ai, ri, si + 1, tarminal) that change randomly from D --Teacher signal yi = ri + γmax Q (si + 1, a: θ) --For the Q Network parameter θ, execute the gradient method with (yi-Q (si, ai; θ)) ^ 2. --Reset Target Network on a regular basis Q = Q

I don't know why it's a plagiarism of the site I referred to.

`dqn_agent.py`


    def store_experience(self, state, targets, action, reward, state_1, targets_1, terminal):
        self.D.append((state, targets, action, reward, state_1, targets_1, terminal))
>>
    def experience_replay(self):
        state_minibatch = []
        y_minibatch = []
>>
        # sample random minibatch
        minibatch_size = min(len(self.D), self.minibatch_size)
        minibatch_indexes = np.random.randint(0, len(self.D), minibatch_size)
>>
        for j in minibatch_indexes:
            state_j, targets_j, action_j, reward_j, state_j_1, targets_j_1, terminal = self.D[j]
            action_j_index = self.enable_actions.index(action_j)
>>
            y_j = self.Q_values(state_j)
>>
            if terminal:
                y_j[action_j_index] = reward_j
            else:
                # reward_j + gamma * max_action' Q(state', action')
                qvalue, action = self.select_enable_action(state_j_1, targets_j_1)
                y_j[action_j_index] = reward_j + self.discount_factor * qvalue
>>
            state_minibatch.append(state_j)
            y_minibatch.append(y_j)
>>
        # training
        self.sess.run(self.training, feed_dict={self.x: state_minibatch, self.y_: y_minibatch})
>>
        # for log
        self.current_loss = self.sess.run(self.loss, feed_dict={self.x: state_minibatch, self.y_: y_minibatch})
>>

Variable name	Contents
state	Board surface( = Reversi.screen[0～7][0～7] )
targets	Number you can leave
action	Selected action
reward	Reward for action 0-1
state_1	Board after action
targets_1	A number that you can leave after you act
terminal	Game ends = True

Implementation

In AI training, players [0] and players [1] played Othello battles n_epochs = 1000 times. Finally, save the AI of the second player [1].

`train.py`


   # parameters
    n_epochs = 1000
    # environment, agent
    env = Reversi()
 
    # playerID    
    playerID = [env.Black, env.White, env.Black]

    # player agent    
    players = []
    # player[0]= env.Black
    players.append(DQNAgent(env.enable_actions, env.name, env.screen_n_rows, env.screen_n_cols))
    # player[1]= env.White
    players.append(DQNAgent(env.enable_actions, env.name, env.screen_n_rows, env.screen_n_cols))

This DQNAgent (env.enable_actions, env.name, env.screen_n_rows, env.screen_n_cols) part is

-Initialize Replay Memory D --Q NetworkQ is initialized with a random weight θ --Initialize Target NetworkQ θ ^ = θ

, Dqn_agent.py is doing it.

`dqn_agent.py`


class DQNAgent:
>>
    def __init__(self, enable_actions, environment_name, rows, cols):
        ...abridgement...
        #Replay Memory D initialization
        self.D = deque(maxlen=self.replay_memory_size)
        ...abridgement...
>>
    def init_model(self):
        # input layer (rows x cols)
        self.x = tf.placeholder(tf.float32, [None, self.rows, self.cols])
>>
        # flatten (rows x cols)
        size = self.rows * self.cols
        x_flat = tf.reshape(self.x, [-1, size])
>>
        #Initialize Q NetworkQ with a random weight θ
        W_fc1 = tf.Variable(tf.truncated_normal([size, size], stddev=0.01))
        b_fc1 = tf.Variable(tf.zeros([size]))
        h_fc1 = tf.nn.relu(tf.matmul(x_flat, W_fc1) + b_fc1)
>>
        #Initialize Target NetworkQ θ^=θ
        W_out = tf.Variable(tf.truncated_normal([size, self.n_actions], stddev=0.01))
        b_out = tf.Variable(tf.zeros([self.n_actions]))
        self.y = tf.matmul(h_fc1, W_out) + b_out
>>
        # loss function
        self.y_ = tf.placeholder(tf.float32, [None, self.n_actions])
        self.loss = tf.reduce_mean(tf.square(self.y_ - self.y))
>>
        # train operation
        optimizer = tf.train.RMSPropOptimizer(self.learning_rate)
        self.training = optimizer.minimize(self.loss)
>>
        # saver
        self.saver = tf.train.Saver()
>>
        # session
        self.sess = tf.Session()
        self.sess.run(tf.initialize_all_variables())

`python`


    for e in range(n_epochs):
        # reset
        env.reset()
        terminal = False

for episode =1, M do --Initial screen x1, preprocess to create initial state s1

`python`


        while terminal == False: #Loop until the end of one episode

            for i in range(0, len(players)): 
                
                state = env.screen
                targets = env.get_enables(playerID[i])
                
                if len(targets) > 0:
                    #If there is a place to put it somewhere

#← Here, all the above-mentioned hands are "saved in D"

                    #Choose an action
                    action = players[i].select_action(state, targets, players[i].exploration)
                    #Take action
                    env.update(action, playerID[i])

while not terminal --Action selection

Action selection ʻagent.select_action (state_t, targets, agent.exploration)` is This is done by dqn_agent.py.

-Action selection --Random action ai -Or ai = argmax Q (s1, a: θ)

`dqn_agent.py`


    def Q_values(self, state):
        # Q(state, action) of all actions
        return self.sess.run(self.y, feed_dict={self.x: [state]})[0]
>>
    def select_action(self, state, targets, epsilon):
>>    
        if np.random.rand() <= epsilon:
            # random
            return np.random.choice(targets)
        else:
            # max_action Q(state, action)
            qvalue, action = self.select_enable_action(state, targets)
            return action
>>  
    #The board(state)so,Place to put(targets)Returns the Q value and number that maximizes the Q value from
    def select_enable_action(self, state, targets):
        Qs = self.Q_values(state)
        #descend = np.sort(Qs)
        index = np.argsort(Qs)
        for action in reversed(index):
            if action in targets:
                break 
        # max_action Q(state, action)
        qvalue = Qs[action]       
>>
        return qvalue, action

-Execute action ai and observe reward ri, next screen xi + 1 and end judgment tarminal --Preprocess and create the next state si + 1

Finally save the AI behind



                #The result of performing the action
                terminal = env.isEnd()     
                              
        w = env.winner()                    
        print("EPOCH: {:03d}/{:03d} | WIN: player{:1d}".format(
                         e, n_epochs, w))


    #Save saves the second player player2.
    players[1].save_model()

The source is here. $ git clone https://github.com/sasaco/tf-dqn-reversi.git

Next time will tell you about the battle edition.

I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Implementation ~

Series table of contents

Basics of reinforcement learning

File structure and role

Overall algorithm

Othello game specifications

Reversi.py

Reversi.py

AI training

Reward for AI

Training method

train.py

dqn_agent.py

Implementation

train.py

dqn_agent.py

python

python

dqn_agent.py

`Reversi.py`

`Reversi.py`

`train.py`

`dqn_agent.py`

`train.py`

`dqn_agent.py`

`python`

`python`

`dqn_agent.py`