Continuing from the last time, I would like to explain the code. This is the last time. From Othello- "Implementation Deep Learning" tic-tac-toe (1) http://qiita.com/Kumapapa2012/items/cb89d73782ddda618c99 Click here for subsequent articles. From Othello- "Implementation Deep Learning" tic-tac-toe (3) http://qiita.com/Kumapapa2012/items/3cc20a75c745dc91e826 Othello-From the third line of "Implementation Deep Learning" (4) [End] http://qiita.com/Kumapapa2012/items/9cec4e6d2c935d11f108
The source code is here. https://github.com/Kumapapa2012/Learning-Machine-Learning/tree/master/Reversi
Although I explain it, this is just a tic-tac-toe sample of the book "Implementation Deep Learning" with an Othello game. For this reason, this article will only explain the changes from the sample and the code of the created Othello. For the meaning and role of each script file, the flow of operation, and a detailed explanation of Deep Q-Leaning, see the book "Implementation Deep Learning".
It's been about 3 months since I started studying properly, so there may be mistakes. If you have any mistakes, comments, questions, or questions, we would appreciate it if you could comment.
Before executing each of the scripts below, the environment must be set up so that the sample of "Implementation Deep Learning" can be executed. The script must be executed after entering the Anaconda environment by executing a command such as the following, as described in the book "Implementation Deep Learning".
. activate main (Or source activate main)
Also, RL_Glue must be started before each script can be executed. (Excluding "Game_Reversi_Test.py").
agent.py It is an "agent" that performs machine learning. Usage:
python agent.py [--gpu <gpu id>] [--size <board size>] --gpu: GPU ID (if omitted or negative value, use CPU) --size: Othello board size (default value 6 if omitted)
Description:
It is an agent that conducts subject learning using DQN.
The board size must match what you specified when you started environment.py. [^ 1]
Through RL_Glue, it receives the following content from environment.py as a one-dimensional array, DQN determines the best move, and returns it to RL_Glue.
layer | Contents |
---|---|
0 | Position of your frame(Agent) |
1 | Position of the opponent's piece(Environment) |
2 | Position where you can place the top |
3 | Position where the opponent can place the piece |
The agent will continue to place pieces as long as it is indicated where it can be placed in layer 2. Layer 2 If there is no place to put your piece, the agent will pass.
Example) In the following situations on a 6x6 board:
- | 0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 1 | -1 | 0 | 0 | 0 |
3 | 0 | 1 | 1 | 0 | 0 | 0 |
4 | 0 | 1 | 0 | 0 | 0 | 0 |
5 | 0 | 0 | 0 | 0 | 0 | 0 |
The input contents are as follows
- | 0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 1 | 0 | 0 | 0 | 0 |
3 | 0 | 1 | 1 | 0 | 0 | 0 |
4 | 0 | 1 | 0 | 0 | 0 | 0 |
5 | 0 | 0 | 0 | 0 | 0 | 0 |
- | 0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 1 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 |
5 | 0 | 0 | 0 | 0 | 0 | 0 |
- | 0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 1 | 1 | 0 | 0 |
2 | 0 | 0 | 0 | 1 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 |
5 | 0 | 0 | 0 | 0 | 0 | 0 |
- | 0 | 1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 1 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 1 | 0 | 1 | 0 | 0 | 0 |
5 | 0 | 0 | 0 | 0 | 0 | 0 |
As an input layer, the neural network has the same number of elements as the above information obtained from environment.py x the number of Actions that can be traced back during training.
This time, I changed the class QNet
init```` to have 8 fully connected hidden layers.
The output layer has a number of nodes that matches the board size.
Of the nodes in this output layer, the one with the highest value is the location of the frame selected by the agent.
In addition, there are the following differences from books.
value | Books | This code | Description |
---|---|---|---|
self.n_rows | 3 | size(6 or more even) | Board size |
self.bdim | self.dim * 2 | self.dim * 4 | Size of training data |
self.capacity | 1 * 10**4 | 2 * 10**4 | Replay Memory retention |
self.n_frames | 3 | 9 | Number of Actions that can be traced back to learning |
self.batch_size | 32 | 128 | Batch size during learning |
Regarding the implementation contents, although there are some changes according to the above parameter changes, the operation flow is the same as the explanation in the book.
environment.py It is the "environment" that is the opponent of the agent. Usage:
python environment.py [--size <board size>] --size: Othello board size (default value 6 if omitted)
Description:
From the sample "environment.py" of the book "Implementation Deep Learning", the logic of tic-tac-toe is removed, and the logic of the game is implemented in "Game_Reversi.py".
Therefore, an instance of Game_Reversi is created and initialized as follows.
environment.py
import Game_Reversi as game_base
#(Omission)
def __init__(self, size):
self.game= game_base.Game_Reversi(size,size)
#(The following is omitted)
The game board will be a square of the specified size and will be held in the g_board of the Game_Reversi class as an N x N Int array. Each array element contains one of the following values:
0 | Empty square |
1 | Agent |
-1 | environment |
By expressing the player with a positive or negative of 1, it is possible to execute actions and judgments simply by reversing the sign, and the game can be played with a minimum of code. The game status described in "agent.py" is created by the function "build_map_from_game".
environment.py
def build_map_from_game(self):
map_data=[]
#0: Current board(Agent=Position of 1 frame)
board_data=(self.game.g_board.reshape(-1)== self.game.turn).astype(int)
map_data.extend(board_data)
#1: Current board(Environment=-Position of 1 frame)
board_data=(self.game.g_board.reshape(-1)==-self.game.turn).astype(int)
map_data.extend(board_data)
#2: A place where the Agent can be placed."turn"Agent and Environment are expressed by the positive and negative of.
#If positive, it becomes an Agent.
pos_available=np.zeros_like(self.game.g_board)
l_available=self.game.getPositionAvail(self.game.turn)
for avail in l_available:
pos_available[tuple(avail)]=1
map_data.extend(pos_available.reshape(-1).tolist())
#3: Place where the environment can be placed
pos_available=np.zeros_like(self.game.g_board)
l_available=self.game.getPositionAvail(-self.game.turn)
for avail in l_available:
pos_available[tuple(avail)]=1
map_data.extend(pos_available.reshape(-1).tolist())
return map_data
Basically, the board state is created and the place where the frame can be placed is created by changing the sign of turn, and a one-dimensional array is created. At the beginning of the game, just reset the board and send the contents created by the above function.
environment.py
def env_start(self):
# plan:Reversi board initialization
self.game.resetBoard()
#Creating map data
self.map=self.build_map_from_game()
#(Omission)
#RL the state of the board_Pass to agent through Glue
observation = Observation()
observation.intArray = self.map
return observation
After the game starts, the agent will specify the action (where to put the piece) to perform the following actions.
Agent actions are integer values. For a 6x6 board, it is an integer of -1 ≤ a ≤ 35. -1 is the path. Use the board size to convert this into a tuple of rows and columns on the board and pass it to the step method of Game_Reversi to execute the action.
environment.py
if int_action_agent==-1 :
step_raw_col=(-1,-1)
else :
step_raw_col=(int_action_agent//self.n_cols,int_action_agent%self.n_cols)
#step execution
step_o, step_r, step_done = self.game.step(step_raw_col
When the action is performed, the board status, rewards and flags of whether the game is settled are passed. At this point, Game_Reversi's g_board has been updated with the agent pieces and the environment pieces. In this state, use the function "build_map_from_game" to create a board state.
Finally, the board state, reward, and whether or not it is settled are stored in the instance rot of the Reward_observation_terminal class of RL_Glue and returned to RL_Glue.
environment.py
# (Omission)
rot = Reward_observation_terminal()
# build_map_from_game()Create a map with.
self.map=self.build_map_from_game()
observation = Observation()
observation.intArray = self.map
rot.o = observation
# step_r is a reward, step_done is whether or not to continue
rot.r=step_r
rot.terminal = step_done
# (Omission)
#If it is settled, the agent of the agent_end
#If there is no settlement, the agent of the agent_Continue to step
return rot
experiment.py It is an "experiment" that manages the game. Usage:
Description: This script has not been changed from the content of the book "Implementation Deep Learning", so I will omit the explanation.
Game_Reversi.py An implementation of the Othello game. Usage:
Description: Implements the rules of the Othello game. According to Othello's official rules, the place to put a piece is limited to the place where the opponent's piece can be turned over. Also, the pass is limited to when there is no place to put your own piece. This game also follows these rules.
The game board is represented by an Int array of Numpy. For this reason, the main logic is implemented using array operations. [^ 2]
The flow of operation is as follows:
Initialization code.
Game_Reversi.py
def __init__(self,n_rows,n_cols):
#Board reset
self.n_rows = n_rows
self.n_cols = n_cols
self.g_board=np.zeros([self.n_rows,self.n_cols],dtype=np.int16)
#Othello puts the first four frames in the center
self.g_board[self.n_rows//2-1,self.n_cols//2-1]=1
self.g_board[self.n_rows//2-1,self.n_cols//2]=-1
self.g_board[self.n_rows//2,self.n_cols//2-1]=-1
self.g_board[self.n_rows//2,self.n_cols//2]=1
As you can see, it is possible to handle non-square boards, but currently on the environment.py side, we are trying to start the game with only squares.
In order to place a piece and return the opponent's piece, you must first identify the place where you can place the piece. This is done with isValidMove (). isValidMove () uses an array operation to determine where to place a frame. For example, on an 8x8 board, ● tries to place a frame at the following X location (2,1).
Follow the steps below to determine if a piece can be placed in this location. Phase1: First, it searches for adjacent frames in 8 directions at the specified location. Check if there is a circle in the place highlighted in green. If there is no ○, at that point, it is judged that ● cannot be placed in that place and the process ends. In this case, there is a circle in the place shown in red, so proceed to Phase 2.
Phase2: Look for ● in the direction of ○ found. The search range is highlighted in yellow. The search proceeds in the direction of the circle, and continues until a ● is found, a blank is found, or the edge of the board is found. At the beginning, for the circle in (3,2) below, a blank is found in (5,4), so this circle cannot be returned. Next, about ○ in (2,2). If you search to the right, you will find ● at (2,5). At this point, you can put a ● at the X position (2,1).
The following code implements these.
Game_Reversi.py
#Check in all directions from the designated location.
#Ends if there is at least one opponent's piece that can be returned
for direction in ([-1,-1],[-1,0],[-1,1],[0,-1],[0,1],[1,-1],[1,0],[1,1]):
if not (0 <= (pos+direction)[0] < self.n_rows and 0 <= (pos+direction)[1] < self.n_cols ) :
#Skip out-of-range processing
continue
#
# Phase 1:Is the color of the adjacent frame the opposite of the specified color?
#
cpos = pos + direction
if (self.g_board[tuple(cpos)] == -c):
#
# Phase 2:Is there your own frame over there?
#
while (0 <= cpos [0] < self.n_rows and 0 <= cpos [1] < self.n_cols):
if (self.g_board[tuple(cpos)] == 0):
#It ends because it is an empty square before the judgment is made
break
elif (self.g_board[tuple(cpos)] == -c):
#If you have your own piece in the future, you may be able to take it.
#Continue searching for your own frame.
cpos = cpos+direction
continue
elif (self.g_board[tuple(cpos)] == c):
#Since at least one frame can be returned, the search ends at this point.
result=True
break
else:
print("catastorophic failure!!! @ isValidMove")
exit()
Running isValidMove () on all blank frames will give you a list of where you can put the frames. This is implemented with the getPositionAvail () function.
def getPositionAvail(self,c):
temp=np.vstack(np.where(self.g_board==0))
nullTiles=np.hstack((temp[0].reshape(-1,1),temp[1].reshape(-1,1)))
#IsValidMove for squares without frames()
can_put=[]
for p_pos in nullTiles:
if self.isValidMove(p_pos[0],p_pos[1],c):
can_put.append(p_pos)
return can_put
Both the agent and the environment select the place to put the piece from the list of places where this piece can be placed. In addition, this function can also obtain a list of places where the opponent can place a frame simply by reversing the sign of the frame.
Turn the frame over with putStone (). In this function, as with isValidMove, frames are searched in 8 directions and Phase 1 and Phase 2 are executed. If you find a piece with the same color as your own piece, go back to the place where you placed the piece and turn it over. In the previous example, the frames are placed in (2,1) and the same color frames are found in (2,5), so the frames are in the order of (2,4), (2,3), (2,2). Turn over. Below is the code.
Game_Reversi.py
for dir in ([-1,-1],[-1,0],[-1,1],[ 0,-1],[ 0,1],[ 1,-1],[ 1,0],[ 1,1]):
f_canFlip=False
cpos = pos + dir
#(Omission)
#Returns a frame from the current cpos position to the specified position.
if f_canFlip :
cpos=cpos-dir #First frame to return
while np.array_equal(cpos,pos) == False:
board[tuple(cpos)] = c
numFlip = numFlip + 1
cpos=cpos-dir
This function returns the number of flipped frames. If the 4th argument is True, the simulation mode will be set and the flipped content will not be reflected in the game. This is used when you want to check in advance where to place the pieces and how many pieces can be turned over. This function can also change the color of the flipped frame by simply inverting the sign.
The agent uses DQN to determine where to place the pieces, while the environment uses getPosition () to determine where to place the pieces. The logic of getPosition () determines the strength of Othello. In this code, the place to put the frame is decided by the following logic.
probability | place |
---|---|
90% | Any of the four corners |
80% | Position where you can get the most frames |
10% or 20% | random(When a piece can be placed somewhere in the four corners 10%If you can't put it 20%) |
Below is the code.
Game_Reversi.py
#Decide whether to make it random
t_rnd=np.random.random()
# 1.90 if there are horns%Take there with a probability of
if cornerPos != []:
if t_rnd < 0.9:
returnPos= cornerPos[np.random.randint(0,len(cornerPos))]
# 2.Then 80%Get the one with the highest number with the probability of.
if returnPos==[]:
if maxPos != []:
if t_rnd < 0.8:
returnPos= maxPos
# 3.Random if not decided at this point(After all, 1,There is a possibility that it will be 2)
if returnPos==[]:
returnPos= can_put[np.random.randint(0,len(can_put))]
return returnPos
With the above, when both the agent and environment actions are completed, the end judgment and reward calculation are performed. The end judgment is made by finding a place where both frames can be placed. The game ends when there is no place to put both. The reward is 0 while the game is ongoing, when the game is over, count the number of both pieces, and if they are the same, it means "draw" -0.5, if there are many agents, it means "win" 1.0, and there are few agents For example, "losing" will give you a reward of -1.0. The code is as follows.
Game_Reversi.py
stonePos_agent = self.getPosition(self.turn)
stonePos_environment = self.getPosition(-self.turn)
if stonePos_agent==[] and stonePos_environment==[]:
done=True
if self.render : print("****************Finish****************")
if done :
# 2.Compensation calculation when finished
num_agent=len(np.where(self.g_board==self.turn)[1])
num_envionment=len(np.where(self.g_board==-self.turn)[1])
if self.render : print("you:%i/environment:%i" % (num_agent,num_envionment))
#Judgment
if num_agent > num_envionment :
reward=1.0
if self.render : print("you win!")
elif num_agent < num_envionment :
reward=-1.0
if self.render : print("you lose!")
else :
reward=-0.5
if self.render : print("Draw!")
Game_Reversi_Test.py It is for testing the above "Game_Reversi.py". Allows you to play against humans on behalf of agents. Usage:
python Game_Reversi_Test.py
Description: The board size is hard coded at 8x8 as shown below, so change it accordingly.
Game_Reversi_Test.py
g=game.Game_Reversi(8,8)
When started, it waits for user input. Specify the position to place the frame with a comma-separated integer such as "2,4". As with the reaction to the agent, place the piece in the specified location and place your own piece. And again, it will wait for input. Eventually, when both players cannot place a piece, the end judgment and score will be displayed and the process will end.
That's all for the explanation. We look forward to helping you. If you have any mistakes, comments, questions, or questions, we would appreciate it if you could comment.
(Book) Implementation Deep learning
http://shop.ohmsha.co.jp/shopdetail/000000004775/
Othello rules
http://www.othello.org/lesson/lesson/rule.html
(Other references will be posted at a later date)
[^ 1]: However, since agent_init receives information from RL_Glue, if you use it, you may not need this argument? [^ 2]: For this reason, I think it is possible to use Cuda's Numpy (cupy). [^ 3]: By the way, if the color of the first move, that is, the agent (= 1) is black according to the official rules, according to the rules set by the "Japan Othello Federation", the arrangement of the initial frames in this class is reversed in black and white. That's right. However, there is no theoretical change just by rotating the aspect 90 degrees, so leave it as it is.
Recommended Posts