I tried to create a reinforcement learning environment for Othello with Open AI gym

Introduction

I created an environment for reinforcement learning of Othello with OpenAI gym. I hope it will be helpful for those who want to create an environment for reinforcement learning in the future. The learning algorithm is not implemented yet. I will study from now on. Click here for the code https://github.com/pigooosuke/gym_reversi

By default, gym / envs contains various learning environments. By the way, in board games, there are Go and hex. This time, I created it with reference to these codes.

Creation procedure

Create an original learning environment under gym / envs /
Register the created environment in gym / envs / \ _ \ _ init \ _ \ _. Py with default values It will be the flow.

The created Env can be called as follows.

import gym
env = gym.make('Reversi8x8-v0')

Env file

Class description

Created class ReversiEnv Basically, it is necessary to write code centering on 5 methods in Env. _step Advance the number of steps by one (Output the player's hand and opponent's hand and check if the game is over)

_reset Load Env defaults (Loading the board, first and second attack, etc.)

_render Illustrate the Status of Env (image, RGB, text are set) (Displays the status of stones on the board)

_close Discard all Env information (Unused this time)

_seed Used to determine action by random seed (Setting)

class init

The initial value is player_color: Player stone color (black first) opponent: opponent's strategy (random this time) observation_type: State encoding (unnecessary settings? Maybe you can delete it. Declaration that status is managed by numpy3c. I have left it for the time being) illegal_place_mode: Penalties for misplay (losing, etc.) board_size: board size (8 this time)

Is set.

action action decides which action to take against Env. Since it is an 8x8 board, 0-63 is the position of the striker, 64 is the end, and 65 is the pass. It is an image that introduces the output of reinforcement learning in the action part

done Later in the step process, you need to check if the game is finished as a result of this step. --There is no place to put stones. ――One of the players has become a single stone. If the condition is met, Returns a reward at the end of the game.

reward The evaluation method is set to win or lose 1, -1.

Game end confirmation

def game_finished(board):
    # Returns 1 if player 1 wins, -1 if player 2 wins and 0 otherwise
    d = board.shape[-1]

    player_score_x, player_score_y = np.where(board[0, :, :] == 1)
    player_score = len(player_score_x)
    opponent_score_x, opponent_score_y = np.where(board[1, :, :] == 1)
    opponent_score = len(opponent_score_x)
    if player_score == 0:
        return -1
    elif opponent_score == 0:
        return 1
    else:
        free_x, free_y = np.where(board[2, :, :] == 1)
        if free_x.size == 0:
            if player_score > (d**2)/2:
                return 1
            elif player_score == (d**2)/2:
                return 1
            else:
                return -1
        else:
            return 0
    return 0

Failure story

At first, I did not set any rules at all, I set 0-63 in action in any status (I can put stones anywhere) and tried to learn the rules themselves, but it converged by learning the first and second moves I couldn't learn well because I ended up limiting the value of action.

Check candidates for stone placement

`python`


def get_enable_to_actions(board, player_color):
    actions=[]
    d = board.shape[-1]
    opponent_color = 1 - player_color
    for pos_x in range(d):
        for pos_y in range(d):
            if (board[2, pos_x, pos_y]==0):
                continue
            for dx in [-1, 0, 1]:
                for dy in [-1, 0, 1]:
                    if(dx == 0 and dy == 0):
                        continue
                    nx = pos_x + dx
                    ny = pos_y + dy
                    n = 0
                    if (nx not in range(d) or ny not in range(d)):
                        continue
                    while(board[opponent_color, nx, ny] == 1):
                        tmp_nx = nx + dx
                        tmp_ny = ny + dy
                        if (tmp_nx not in range(d) or tmp_ny not in range(d)):
                            break
                        n += 1
                        nx += dx
                        ny += dy
                    if(n > 0 and board[player_color, nx, ny] == 1):
                        actions.append(pos_x*8+pos_y)
    if len(actions)==0:
        actions = [d**2 + 1]
    return actions