I created an environment for reinforcement learning of Othello with OpenAI gym. I hope it will be helpful for those who want to create an environment for reinforcement learning in the future. The learning algorithm is not implemented yet. I will study from now on. Click here for the code https://github.com/pigooosuke/gym_reversi
By default, gym / envs contains various learning environments. By the way, in board games, there are Go and hex. This time, I created it with reference to these codes.
The created Env can be called as follows.
import gym
env = gym.make('Reversi8x8-v0')
Created class ReversiEnv Basically, it is necessary to write code centering on 5 methods in Env. _step Advance the number of steps by one (Output the player's hand and opponent's hand and check if the game is over)
_reset Load Env defaults (Loading the board, first and second attack, etc.)
_render Illustrate the Status of Env (image, RGB, text are set) (Displays the status of stones on the board)
_close Discard all Env information (Unused this time)
_seed Used to determine action by random seed (Setting)
The initial value is player_color: Player stone color (black first) opponent: opponent's strategy (random this time) observation_type: State encoding (unnecessary settings? Maybe you can delete it. Declaration that status is managed by numpy3c. I have left it for the time being) illegal_place_mode: Penalties for misplay (losing, etc.) board_size: board size (8 this time)
Is set.
action action decides which action to take against Env. Since it is an 8x8 board, 0-63 is the position of the striker, 64 is the end, and 65 is the pass. It is an image that introduces the output of reinforcement learning in the action part
done Later in the step process, you need to check if the game is finished as a result of this step. --There is no place to put stones. ――One of the players has become a single stone. If the condition is met, Returns a reward at the end of the game.
reward The evaluation method is set to win or lose 1, -1.
def game_finished(board):
# Returns 1 if player 1 wins, -1 if player 2 wins and 0 otherwise
d = board.shape[-1]
player_score_x, player_score_y = np.where(board[0, :, :] == 1)
player_score = len(player_score_x)
opponent_score_x, opponent_score_y = np.where(board[1, :, :] == 1)
opponent_score = len(opponent_score_x)
if player_score == 0:
return -1
elif opponent_score == 0:
return 1
else:
free_x, free_y = np.where(board[2, :, :] == 1)
if free_x.size == 0:
if player_score > (d**2)/2:
return 1
elif player_score == (d**2)/2:
return 1
else:
return -1
else:
return 0
return 0
At first, I did not set any rules at all, I set 0-63 in action in any status (I can put stones anywhere) and tried to learn the rules themselves, but it converged by learning the first and second moves I couldn't learn well because I ended up limiting the value of action.
python
def get_enable_to_actions(board, player_color):
actions=[]
d = board.shape[-1]
opponent_color = 1 - player_color
for pos_x in range(d):
for pos_y in range(d):
if (board[2, pos_x, pos_y]==0):
continue
for dx in [-1, 0, 1]:
for dy in [-1, 0, 1]:
if(dx == 0 and dy == 0):
continue
nx = pos_x + dx
ny = pos_y + dy
n = 0
if (nx not in range(d) or ny not in range(d)):
continue
while(board[opponent_color, nx, ny] == 1):
tmp_nx = nx + dx
tmp_ny = ny + dy
if (tmp_nx not in range(d) or tmp_ny not in range(d)):
break
n += 1
nx += dx
ny += dy
if(n > 0 and board[player_color, nx, ny] == 1):
actions.append(pos_x*8+pos_y)
if len(actions)==0:
actions = [d**2 + 1]
return actions
Recommended Posts