Beta version of ChainerRL has been released, so I used it immediately. Here, change it for tic-tac-toe (○ × game) by referring to the source of Quick Start Guide. I am.

Package installation

First, install ChainerRL.

pip install chainerrl

You will need cmake, so if you haven't installed it, please install it in advance.

brew install cmake

My environment is as follows.

macOS Sierra 10.12.3 (MBP Late2016)
Anaconda3-4.1.0
Chainer 1.21.0

Game board preparation

Regardless of the type of player (DQN, random hit, human, etc.), you will need a game board to play the ○ × game, so create it first. This time, I will write all the sources in one file without dividing the file, so import the necessary libraries at the beginning.

`dqn.py`


import chainer
import chainer.functions as F
import chainer.links as L
import chainerrl
import numpy as np

#Game board
class Board():
    def reset(self):
        self.board = np.array([0] * 9, dtype=np.float32)
        self.winner = None
        self.missed = False
        self.done = False

    def move(self, act, turn):
        if self.board[act] == 0:
            self.board[act] = turn
            self.check_winner()
        else:
            self.winner = turn*-1
            self.missed = True
            self.done = True

    def check_winner(self):
        win_conditions = ((0,1,2),(3,4,5),(6,7,8),(0,3,6),(1,4,7),(2,5,8),(0,4,8),(2,4,6))
        for cond in win_conditions:
            if self.board[cond[0]] == self.board[cond[1]] == self.board[cond[2]]:
                if self.board[cond[0]]!=0:
                    self.winner=self.board[cond[0]]
                    self.done = True
                    return
        if np.count_nonzero(self.board) == 9:
            self.winner = 0
            self.done = True

    def get_empty_pos(self):
        empties = np.where(self.board==0)[0]
        if len(empties) > 0:
            return np.random.choice(empties)
        else:
            return 0
    
    def show(self):
        row = " {} | {} | {} "
        hr = "\n-----------\n"
        tempboard = []
        for i in self.board:
            if i == 1:
                tempboard.append("○")
            elif i == -1:
                tempboard.append("×")
            else:
                tempboard.append(" ")
        print((row + hr + row + hr + row).format(*tempboard))

The following five functions. I'm sorry for the unclear source for Python beginners, but you can see what you're doing.

--reset Initialize the game board. Run before the start of each episode --move Perform hand placement. After placement, win / loss judgment, mistakes (placement in squares that cannot be placed), and game end judgment --check_winner Victory judgment --get_empty_pos Get one of the indexes of the cells that can be placed at random. As will be described later, it is used when making random hits. --show Shows the board status. For playing against humans

Preparation for random hits to use during Explorer

It seems that it is good to make an adventure occasionally so as not to fall into a local solution, and Quickstart has such an implementation, so it follows here as well. In Quickstart, I used that of gym, but here I have to make it myself, so add the following code at the end.

`dqn.py`


#Random function object for explorer
class RandomActor:
    def __init__(self, board):
        self.board = board
        self.random_count = 0
    def random_action_func(self):
        self.random_count += 1
        return self.board.get_empty_pos()

random_action_func is the key to this object. Call get_empty_pos of the Board created earlier to get the space that can be placed and return it to the caller. It also increments the counter so that you can later see how much this function was used as stats (whether DQN returned it randomly instead of thinking). Why did you bother to make something like this a separate object? Will be explained later.

Preparation of Q-function

It is the main enclosure for DQN, and Chainer RL comes into play.

`dqn.py`


#Q function
class QFunction(chainer.Chain):
    def __init__(self, obs_size, n_actions, n_hidden_channels=81):
        super().__init__(
            l0=L.Linear(obs_size, n_hidden_channels),
            l1=L.Linear(n_hidden_channels, n_hidden_channels),
            l2=L.Linear(n_hidden_channels, n_hidden_channels),
            l3=L.Linear(n_hidden_channels, n_actions))
    def __call__(self, x, test=False):
        #-Because it deals with 1, leaky_relu
        h = F.leaky_relu(self.l0(x))
        h = F.leaky_relu(self.l1(h))
        h = F.leaky_relu(self.l2(h))
        return chainerrl.action_value.DiscreteActionValue(self.l3(h))

···that's all. It's simple enough to beat a little. It's almost the same as defining an NN normally.

Environment and Agent preparation

Now that we have something to build around, all we have to do is prepare the environment and Agent and build the progress of the game. First from the environment and Agent.

`dqn.py`


#Board preparation
b = Board()
#Preparing a random function object for explorer
ra = RandomActor(b)
#Number of dimensions of environment and behavior
obs_size = 9
n_actions = 9
# Q-function and optimizer setup
q_func = QFunction(obs_size, n_actions)
optimizer = chainer.optimizers.Adam(eps=1e-2)
optimizer.setup(q_func)
#Reward discount rate
gamma = 0.95
# Epsilon-Occasionally adventure with greedy. End in 50000 steps_become epsilon
explorer = chainerrl.explorers.LinearDecayEpsilonGreedy(
    start_epsilon=1.0, end_epsilon=0.3, decay_steps=50000, random_action_func=ra.random_action_func)
#Buffer used in the learning method used in DQN called Experience Replay
replay_buffer = chainerrl.replay_buffer.ReplayBuffer(capacity=10 ** 6)
#Agent generation (replay)_Two sharing buffer etc.)
agent_p1 = chainerrl.agents.DoubleDQN(
    q_func, optimizer, replay_buffer, gamma, explorer,
    replay_start_size=500, update_frequency=1,
    target_update_frequency=100)
agent_p2 = chainerrl.agents.DoubleDQN(
    q_func, optimizer, replay_buffer, gamma, explorer,
    replay_start_size=500, update_frequency=1,
    target_update_frequency=100)

By the way, at Epsilon-greedy, the Random Actor I made earlier will appear. It is necessary to pass a reference to the function to be used when adventuring to explorer in advance, but it seems that you can not pass an argument to that function? So, I passed a reference to the game board to the member variable of the RandomActor object that was instantiated in advance, and in the internal processing of explorer, I asked random_action_func to be called without any argument and made it okay. I think there is a smarter way, so please let me know. ..

In addition, the method of ε-greedy is changed to a method (LinearDecayEpsilonGreedy) that gradually reduces the value instead of making it a constant value. Start with 1.0 = always random, then take 50,000 steps and finally reduce to 0.3. I don't know if this number is also valid, so it may be good to change it in various ways.

The Agent creates P1 and P2 that share an optimizer and replay_buffer to fight each other.

Creating the game progress part

I think it's a little tedious because I want to do it already, but if I add this, I can do it, so please be patient.

`dqn.py`


#Number of learning games
n_episodes = 20000
#Declaration of counter
miss = 0
win = 0
draw = 0
#Repeated episodes
for i in range(1, n_episodes + 1):
    b.reset()
    reward = 0
    agents = [agent_p1, agent_p2]
    turn = np.random.choice([0, 1])
    last_state = None
    while not b.done:
        #Placement mass acquisition
        action = agents[turn].act_and_train(b.board.copy(), reward)
        #Perform placement
        b.move(action, 1)
        #As a result of placement, at the end, set the reward and counter to learn
        if b.done == True:
            if b.winner == 1:
                reward = 1
                win += 1
            elif b.winner == 0:
                draw += 1
            else:
                reward = -1
            if b.missed is True:
                miss += 1
            #Learn by ending the episode
            agents[turn].stop_episode_and_train(b.board.copy(), reward, True)
            #The other party also finishes the episode and learns. Don't learn your opponent's mistakes as a victory
            if agents[1 if turn == 0 else 0].last_state is not None and b.missed is False:
                #Last saved in the previous turn_Pass state as state after action execution
                agents[1 if turn == 0 else 0].stop_episode_and_train(last_state, reward*-1, True)
        else:
            #Evacuate the last state of the turn for learning
            last_state = b.board.copy()
            #Invert the value on the board when continuing
            b.board = b.board * -1
            #Switch turns
            turn = 1 if turn == 0 else 0

    #Progress display on console
    if i % 100 == 0:
        print("episode:", i, " / rnd:", ra.random_count, " / miss:", miss, " / win:", win, " / draw:", draw, " / statistics:", agent_p1.get_statistics(), " / epsilon:", agent_p1.explorer.epsilon)
        #Counter initialization
        miss = 0
        win = 0
        draw = 0
        ra.random_count = 0
    if i % 10000 == 0:
        #Save model for every 10000 episodes
        agent_p1.save("result_" + str(i))

print("Training finished.")

20000 Consists of a nested for statement that repeats the game and a while statement that repeats the in-game turn. As a point, both the first attack and the second attack are the Agent itself. In this game, instead of ○ ×, you will place your own hand as 1 and your opponent's hand as -1 on the board, but since you want to learn both the environment and action of the first attack and the second attack, in the game progress Always put a 1 on the board instead of splitting the codes.

        #Perform placement
        b.move(action, 1)

Of course, if it is left as it is, the board will be full of 1, so the sign of the board is reversed when the turn is changed.

        #Invert the value on the board when continuing
        else:
            b.board = b.board * -1

And finally, we are saving the trained model. ChainerRL seems to create this directory even if it does not exist, so I tried to save the history in the directory with the number of episodes at the end for every 10,000 episodes. Since we are training with the same experience, we only save agent_p1.

Execution of learning

Let's do it now ...! Since the value of epsilon is large at the beginning, most of them are random hits (the number of times the rnd value is hit randomly). Therefore, there are few mistakes, but if the number of random hits gradually decreases, the chances of hitting with the hand that DQN thinks will increase, so the number of mistakes will increase temporarily, but as learning progresses, it will converge and 15,000 times. When it exceeds, it becomes almost the first half of one digit.

episode: 100  / rnd: 761  / miss: 1  / win: 85  / draw: 14  / statistics: [('average_q', 0.11951273068342624), ('average_loss', 0.09235552993858538)]  / epsilon: 0.994778
episode: 200  / rnd: 722  / miss: 3  / win: 85  / draw: 12  / statistics: [('average_q', 0.35500590929140996), ('average_loss', 0.12790488153218765)]  / epsilon: 0.9895
episode: 300  / rnd: 756  / miss: 6  / win: 82  / draw: 12  / statistics: [('average_q', 0.6269444783473722), ('average_loss', 0.12164947750267516)]  / epsilon: 0.984278
: (Omitted)
episode: 19800  / rnd: 212  / miss: 1  / win: 69  / draw: 30  / statistics: [('average_q', 0.49387913595157096), ('average_loss', 0.07891365175610675)]  / epsilon: 0.3
episode: 19900  / rnd: 229  / miss: 1  / win: 61  / draw: 38  / statistics: [('average_q', 0.49195677296191365), ('average_loss', 0.07796313042393459)]  / epsilon: 0.3
episode: 20000  / rnd: 216  / miss: 0  / win: 70  / draw: 30  / statistics: [('average_q', 0.509864846571749), ('average_loss', 0.07866546801090374)]  / epsilon: 0.3
Training finished.

Play against yourself!

It seems that you are hitting without making mistakes, and even though you are hitting randomly occasionally, it will be quite a Draw, so I will play against myself to check the strength.

Creating a Human Player

First, create an object called HumanPlayer as an interface for humans to hit.

`dqn.py`


#Human player
class HumanPlayer:
    def act(self, board):
        valid = False
        while not valid:
            try:
                act = input("Please enter 1-9: ")
                act = int(act)
                if act >= 1 and act <= 9 and board[act-1] == 0:
                    valid = True
                    return act-1
                else:
                    print ("Invalid move")
            except Exception as e:
                    print (act +  " is invalid")

Creating the interpersonal game progress part

This is the progress part. While fixing DQN agent to 1 and human to -1, the first attack and second attack decide "whether the DQN agent is the first attack" before the start of the episode and control whether to skip the first time I am. In that connection, agents are always ○ and humans are always ×, regardless of whether they are first or second.

`dqn.py`


#Verification
human_player = HumanPlayer()
for i in range(10):
    b.reset()
    dqn_first = np.random.choice([True, False])
    while not b.done:
        #DQN
        if dqn_first or np.count_nonzero(b.board) > 0:
            b.show()
            action = agent_p1.act(b.board.copy())
            b.move(action, 1)
            if b.done == True:
                if b.winner == 1:
                    print("DQN Win")
                elif b.winner == 0:
                    print("Draw")
                else:
                    print("DQN Missed")
                agent_p1.stop_episode()
                continue
        #Human
        b.show()
        action = human_player.act(b.board.copy())
        b.move(action, -1)
        if b.done == True:
            if b.winner == -1:
                print("HUMAN Win")
            elif b.winner == 0:
                print("Draw")
            agent_p1.stop_episode()

print("Test finished.")

The point is that the agent doesn't learn here, so it's supposed to use act () and stop_episode (). This is also as Quickstart.

Now that we're ready for the match, it's barren to train 20,000 times again, so load the saved Agent. Actually, it is smart to switch between learning with the execution parameter of dqn.py / loading the existing model, but since I want to play faster, I will skip the learning process by setting the number of learning episodes to 0 as follows.

`dqn.py`


#Number of learning games
n_episodes = 0

Then, after the training process is completed, add the following code to load the model.

`dqn.py`


print("Training finished.")

agent_p1.load("result_20000")  #← Add this

When you're ready, it's time to play!

Training finished.
   |   |   
-----------
   |   |   
-----------
   |   |   
   |   |   
-----------
   |   |   
-----------
 ○ |   |   
Please enter 1-9: 1
 × |   |   
-----------
   |   |   
-----------
 ○ |   |   
 × |   |   
-----------
   |   |   
-----------
 ○ |   | ○ 
Please enter 1-9: 8

You can play! Yay! !!

in conclusion

Thank you for acquainting me with DQN and Python. I'm very happy that he has grown to the point where he can almost certainly hit the joseki without teaching the rules. Moreover, it is much cleaner than implementing DQN using Chainer as it is. ChainerRL is amazing! !! With a better outlook, it seems possible to prevent bugs from being mixed in trying to improve various things.

I think there are a lot of things that are wrong, such as "this should be done", "learning progresses in this way", and "I can't learn with this", so I would appreciate it if you could point out various things. Thank you for your cooperation.

What I am particularly concerned about is that the way the Agent is hit seems to be the same almost every time. Do I have to make it more adventurous? If you learn 350,000 episodes, it will hit as usual, so it will be strong, and if you set ε to 0, it will be Draw almost every time, so it is a good thing. .. From 150,000 to 200,000 episodes, the results and loss became constant.

Whole source

For the time being, I will post the entire source. If the environment is complete, you can copy and paste it and move it immediately.

`dqn.py`


import chainer
import chainer.functions as F
import chainer.links as L
import chainerrl
import numpy as np

#Game board
class Board():
    def reset(self):
        self.board = np.array([0] * 9, dtype=np.float32)
        self.winner = None
        self.missed = False
        self.done = False

    def move(self, act, turn):
        if self.board[act] == 0:
            self.board[act] = turn
            self.check_winner()
        else:
            self.winner = turn*-1
            self.missed = True
            self.done = True

    def check_winner(self):
        win_conditions = ((0,1,2),(3,4,5),(6,7,8),(0,3,6),(1,4,7),(2,5,8),(0,4,8),(2,4,6))
        for cond in win_conditions:
            if self.board[cond[0]] == self.board[cond[1]] == self.board[cond[2]]:
                if self.board[cond[0]]!=0:
                    self.winner=self.board[cond[0]]
                    self.done = True
                    return
        if np.count_nonzero(self.board) == 9:
            self.winner = 0
            self.done = True

    def get_empty_pos(self):
        empties = np.where(self.board==0)[0]
        if len(empties) > 0:
            return np.random.choice(empties)
        else:
            return 0
    
    def show(self):
        row = " {} | {} | {} "
        hr = "\n-----------\n"
        tempboard = []
        for i in self.board:
            if i == 1:
                tempboard.append("○")
            elif i == -1:
                tempboard.append("×")
            else:
                tempboard.append(" ")
        print((row + hr + row + hr + row).format(*tempboard))

#Random function object for explorer
class RandomActor:
    def __init__(self, board):
        self.board = board
        self.random_count = 0
    def random_action_func(self):
        self.random_count += 1
        return self.board.get_empty_pos()

#Q function
class QFunction(chainer.Chain):
    def __init__(self, obs_size, n_actions, n_hidden_channels=81):
        super().__init__(
            l0=L.Linear(obs_size, n_hidden_channels),
            l1=L.Linear(n_hidden_channels, n_hidden_channels),
            l2=L.Linear(n_hidden_channels, n_hidden_channels),
            l3=L.Linear(n_hidden_channels, n_actions))
    def __call__(self, x, test=False):
        #-Because it deals with 1, leaky_relu
        h = F.leaky_relu(self.l0(x))
        h = F.leaky_relu(self.l1(h))
        h = F.leaky_relu(self.l2(h))
        return chainerrl.action_value.DiscreteActionValue(self.l3(h))

#Board preparation
b = Board()
#Preparing a random function object for explorer
ra = RandomActor(b)
#Number of dimensions of environment and behavior
obs_size = 9
n_actions = 9
# Q-function and optimizer setup
q_func = QFunction(obs_size, n_actions)
optimizer = chainer.optimizers.Adam(eps=1e-2)
optimizer.setup(q_func)
#Reward discount rate
gamma = 0.95
# Epsilon-Occasionally adventure with greedy. End in 50000 steps_become epsilon
explorer = chainerrl.explorers.LinearDecayEpsilonGreedy(
    start_epsilon=1.0, end_epsilon=0.3, decay_steps=50000, random_action_func=ra.random_action_func)
#Buffer used in the learning method used in DQN called Experience Replay
replay_buffer = chainerrl.replay_buffer.ReplayBuffer(capacity=10 ** 6)
#Agent generation (replay)_Two sharing buffer etc.)
agent_p1 = chainerrl.agents.DoubleDQN(
    q_func, optimizer, replay_buffer, gamma, explorer,
    replay_start_size=500, update_frequency=1,
    target_update_frequency=100)
agent_p2 = chainerrl.agents.DoubleDQN(
    q_func, optimizer, replay_buffer, gamma, explorer,
    replay_start_size=500, update_frequency=1,
    target_update_frequency=100)

#Number of learning games
n_episodes = 20000
#Declaration of counter
miss = 0
win = 0
draw = 0
#Repeated episodes
for i in range(1, n_episodes + 1):
    b.reset()
    reward = 0
    agents = [agent_p1, agent_p2]
    turn = np.random.choice([0, 1])
    last_state = None
    while not b.done:
        #Placement mass acquisition
        action = agents[turn].act_and_train(b.board.copy(), reward)
        #Perform placement
        b.move(action, 1)
        #As a result of placement, at the end, set the reward and counter to learn
        if b.done == True:
            if b.winner == 1:
                reward = 1
                win += 1
            elif b.winner == 0:
                draw += 1
            else:
                reward = -1
            if b.missed is True:
                miss += 1
            #Learn by ending the episode
            agents[turn].stop_episode_and_train(b.board.copy(), reward, True)
            #The other party also finishes the episode and learns. Don't learn your opponent's mistakes as a victory
            if agents[1 if turn == 0 else 0].last_state is not None and b.missed is False:
                #Last saved in the previous turn_Pass state as state after action execution
                agents[1 if turn == 0 else 0].stop_episode_and_train(last_state, reward*-1, True)
        else:
            #Evacuate the last state of the turn for learning
            last_state = b.board.copy()
            #Invert the value on the board when continuing
            b.board = b.board * -1
            #Switch turns
            turn = 1 if turn == 0 else 0

    #Progress display on console
    if i % 100 == 0:
        print("episode:", i, " / rnd:", ra.random_count, " / miss:", miss, " / win:", win, " / draw:", draw, " / statistics:", agent_p1.get_statistics(), " / epsilon:", agent_p1.explorer.epsilon)
        #Counter initialization
        miss = 0
        win = 0
        draw = 0
        ra.random_count = 0
    if i % 10000 == 0:
        #Save model for every 10000 episodes
        agent_p1.save("result_" + str(i))

print("Training finished.")

#Human player
class HumanPlayer:
    def act(self, board):
        valid = False
        while not valid:
            try:
                act = input("Please enter 1-9: ")
                act = int(act)
                if act >= 1 and act <= 9 and board[act-1] == 0:
                    valid = True
                    return act-1
                else:
                    print("Invalid move")
            except Exception as e:
                print(act +  " is invalid")

#Verification
human_player = HumanPlayer()
for i in range(10):
    b.reset()
    dqn_first = np.random.choice([True, False])
    while not b.done:
        #DQN
        if dqn_first or np.count_nonzero(b.board) > 0:
            b.show()
            action = agent_p1.act(b.board.copy())
            b.move(action, 1)
            if b.done == True:
                if b.winner == 1:
                    print("DQN Win")
                elif b.winner == 0:
                    print("Draw")
                else:
                    print("DQN Missed")
                agent_p1.stop_episode()
                continue
        #Human
        b.show()
        action = human_player.act(b.board.copy())
        b.move(action, -1)
        if b.done == True:
            if b.winner == -1:
                print("HUMAN Win")
            elif b.winner == 0:
                print("Draw")
            agent_p1.stop_episode()

print("Test finished.")

Reference site

ChainerRL -DQN with Chainer. I tried various reinforcement learning in tic-tac-toe. (Deep Q Network, Q-Learning, Monte Carlo) -History of DQN + Deep Q-Network written in Chainer

I tried deep reinforcement learning (Double DQN) for tic-tac-toe with ChainerRL

Package installation

Game board preparation

dqn.py

Preparation for random hits to use during Explorer

dqn.py

Preparation of Q-function

dqn.py

Environment and Agent preparation

dqn.py

Creating the game progress part

dqn.py

Execution of learning

Play against yourself!

Creating a Human Player

dqn.py

Creating the interpersonal game progress part

dqn.py

dqn.py

dqn.py

in conclusion

Whole source

dqn.py

Reference site

`dqn.py`

`dqn.py`

`dqn.py`

`dqn.py`

`dqn.py`

`dqn.py`

`dqn.py`

`dqn.py`

`dqn.py`

`dqn.py`