-I tried to make Othello AI with tensorflow without understanding the theory of machine learning-Introduction- --I tried to make Othello AI with tensorflow without understanding the theory of machine learning-Implementation- -I tried to make Othello AI with tensorflow without understanding the theory of machine learning-Iza battle- -I tried to make Othello AI after trying to understand the theory of machine learning ~ Restart! ~ -I tried to make Othello AI after trying to understand the theory of machine learning-What is this Alpha Zero edition- -I tried to make a neutral network with Excel to understand the theory of machine learning ~ Image recognition mnist edition ~
Last time continued ... In this field, as an outsider, I didn't study "theory of machine learning" at all. I would like to make an AI for Othello. Click here for the referenced site ・ Implement DQN with Keras, TensorFlow and OpenAI Gym ・ Training TensorFlow neural network to play Tic-Tac-Toe game using one-step Q-learning algorithm.
I made Othello's AI without studying "machine learning theory" at all. Here is a summary of the minimum knowledge required to implement it.
The file structure and role are like this. --train.py --- AI training --Reversi.py --- Management of Othello games --dqn_agent.py --- Management of AI training --FightWithAI.py --- Battle with users
The DQN algorithm implemented this time looks like this.
If you keep this flow in mind, you will understand what you are talking about and what you are going to explain.
The board used for Othello games and AI training This is done using a two-dimensional array with No in the figure below.
Reversi.py
self.screen[0~7][0~7]
The action that AI can select is to select the number from 0 to 63 in the above figure.
Reversi.py
self.enable_actions[0~63]
In AI training, players [0] and players [1] played Othello battles n_epochs = 1000 times. Finally, save the AI of the second player [1].
--If you win the game, set reward = 1. --Other than that, reward = 0
I will play with two AIs, but I have to act even on the opponent's turn Because the story until the end is not connected (Q value is not transmitted)
Both act on every turn This time, I decided to "save the transition in D" for all the numbers that can be set separately from the progress of the game.
train.py
#targets contains all the numbers you can put this turn
for tr in targets:
#Duplicate the status quo
tmp = copy.deepcopy(env)
#Action
tmp.update(tr, playerID[i])
#End judgment
win = tmp.winner()
end = tmp.isEnd()
#Board after action
state_X = tmp.screen
#A number that you can leave after you act
target_X = tmp.get_enables(playerID[i+1])
#Both actions
for j in range(0, len(players)):
reword = 0
if end == True:
if win == playerID[j]:
#If you win, you get 1 reward
reword = 1
#Both "Save transition in D"
players[j].store_experience(state, targets, tr, reword, state_X, target_X, end)
players[j].experience_replay()
The following part of the DQN algorithm is done by dqn_agent.py.
--Save transition to D (, ai, ri, si + 1, terminal) --Sample mini-patches (si, ai, ri, si + 1, tarminal) that change randomly from D --Teacher signal yi = ri + γmax Q (si + 1, a: θ) --For the Q Network parameter θ, execute the gradient method with (yi-Q (si, ai; θ)) ^ 2. --Reset Target Network on a regular basis Q = Q
I don't know why it's a plagiarism of the site I referred to.
dqn_agent.py
def store_experience(self, state, targets, action, reward, state_1, targets_1, terminal):
self.D.append((state, targets, action, reward, state_1, targets_1, terminal))
>>
def experience_replay(self):
state_minibatch = []
y_minibatch = []
>>
# sample random minibatch
minibatch_size = min(len(self.D), self.minibatch_size)
minibatch_indexes = np.random.randint(0, len(self.D), minibatch_size)
>>
for j in minibatch_indexes:
state_j, targets_j, action_j, reward_j, state_j_1, targets_j_1, terminal = self.D[j]
action_j_index = self.enable_actions.index(action_j)
>>
y_j = self.Q_values(state_j)
>>
if terminal:
y_j[action_j_index] = reward_j
else:
# reward_j + gamma * max_action' Q(state', action')
qvalue, action = self.select_enable_action(state_j_1, targets_j_1)
y_j[action_j_index] = reward_j + self.discount_factor * qvalue
>>
state_minibatch.append(state_j)
y_minibatch.append(y_j)
>>
# training
self.sess.run(self.training, feed_dict={self.x: state_minibatch, self.y_: y_minibatch})
>>
# for log
self.current_loss = self.sess.run(self.loss, feed_dict={self.x: state_minibatch, self.y_: y_minibatch})
>>
Variable name | Contents |
---|---|
state | Board surface( = Reversi.screen[0~7][0~7] ) |
targets | Number you can leave |
action | Selected action |
reward | Reward for action 0-1 |
state_1 | Board after action |
targets_1 | A number that you can leave after you act |
terminal | Game ends = True |
In AI training, players [0] and players [1] played Othello battles n_epochs = 1000 times. Finally, save the AI of the second player [1].
train.py
# parameters
n_epochs = 1000
# environment, agent
env = Reversi()
# playerID
playerID = [env.Black, env.White, env.Black]
# player agent
players = []
# player[0]= env.Black
players.append(DQNAgent(env.enable_actions, env.name, env.screen_n_rows, env.screen_n_cols))
# player[1]= env.White
players.append(DQNAgent(env.enable_actions, env.name, env.screen_n_rows, env.screen_n_cols))
This
DQNAgent (env.enable_actions, env.name, env.screen_n_rows, env.screen_n_cols)
part is-Initialize Replay Memory D --Q NetworkQ is initialized with a random weight θ --Initialize Target NetworkQ θ ^ = θ
, Dqn_agent.py is doing it.
dqn_agent.py
class DQNAgent:
>>
def __init__(self, enable_actions, environment_name, rows, cols):
...abridgement...
#Replay Memory D initialization
self.D = deque(maxlen=self.replay_memory_size)
...abridgement...
>>
def init_model(self):
# input layer (rows x cols)
self.x = tf.placeholder(tf.float32, [None, self.rows, self.cols])
>>
# flatten (rows x cols)
size = self.rows * self.cols
x_flat = tf.reshape(self.x, [-1, size])
>>
#Initialize Q NetworkQ with a random weight θ
W_fc1 = tf.Variable(tf.truncated_normal([size, size], stddev=0.01))
b_fc1 = tf.Variable(tf.zeros([size]))
h_fc1 = tf.nn.relu(tf.matmul(x_flat, W_fc1) + b_fc1)
>>
#Initialize Target NetworkQ θ^=θ
W_out = tf.Variable(tf.truncated_normal([size, self.n_actions], stddev=0.01))
b_out = tf.Variable(tf.zeros([self.n_actions]))
self.y = tf.matmul(h_fc1, W_out) + b_out
>>
# loss function
self.y_ = tf.placeholder(tf.float32, [None, self.n_actions])
self.loss = tf.reduce_mean(tf.square(self.y_ - self.y))
>>
# train operation
optimizer = tf.train.RMSPropOptimizer(self.learning_rate)
self.training = optimizer.minimize(self.loss)
>>
# saver
self.saver = tf.train.Saver()
>>
# session
self.sess = tf.Session()
self.sess.run(tf.initialize_all_variables())
python
for e in range(n_epochs):
# reset
env.reset()
terminal = False
- for episode =1, M do --Initial screen x1, preprocess to create initial state s1
python
while terminal == False: #Loop until the end of one episode
for i in range(0, len(players)):
state = env.screen
targets = env.get_enables(playerID[i])
if len(targets) > 0:
#If there is a place to put it somewhere
#← Here, all the above-mentioned hands are "saved in D"
#Choose an action
action = players[i].select_action(state, targets, players[i].exploration)
#Take action
env.update(action, playerID[i])
- while not terminal --Action selection
Action selection ʻagent.select_action (state_t, targets, agent.exploration)` is This is done by dqn_agent.py.
-Action selection --Random action ai -Or ai = argmax Q (s1, a: θ)
dqn_agent.py
def Q_values(self, state):
# Q(state, action) of all actions
return self.sess.run(self.y, feed_dict={self.x: [state]})[0]
>>
def select_action(self, state, targets, epsilon):
>>
if np.random.rand() <= epsilon:
# random
return np.random.choice(targets)
else:
# max_action Q(state, action)
qvalue, action = self.select_enable_action(state, targets)
return action
>>
#The board(state)so,Place to put(targets)Returns the Q value and number that maximizes the Q value from
def select_enable_action(self, state, targets):
Qs = self.Q_values(state)
#descend = np.sort(Qs)
index = np.argsort(Qs)
for action in reversed(index):
if action in targets:
break
# max_action Q(state, action)
qvalue = Qs[action]
>>
return qvalue, action
-Execute action ai and observe reward ri, next screen xi + 1 and end judgment tarminal --Preprocess and create the next state si + 1
Finally save the AI behind
#The result of performing the action
terminal = env.isEnd()
w = env.winner()
print("EPOCH: {:03d}/{:03d} | WIN: player{:1d}".format(
e, n_epochs, w))
#Save saves the second player player2.
players[1].save_model()
The source is here.
$ git clone https://github.com/sasaco/tf-dqn-reversi.git
Next time will tell you about the battle edition.
Recommended Posts