Let's make a super-simple Dragon Quest-style turn-based battle and let it learn Q. The purpose is to make the brave man, who can only save the world with a few percent chance, smarter by Q-learning.
In addition, I will explain the implementation of the game part and Q-learning, but I will not explain the Q-learning itself. If you want to know the detailed theory of Q-learning, you will be happy if you read these good articles one by one.
Reinforcement learning that cannot be heard now (1): State value function and Bellman equation
--Those who want to make their own games and play with them instead of the existing simulation environment such as OpenAI Gym. ――Q Those who somehow know the theory of learning but don't know how to implement it!
The rules are simply designed as follows:
--Brave vs Demon King 1: 1 --The only action the Demon King takes is "attack" --There are two choices for the hero's actions: "attack" and "recovery". ――The order of actions is determined by multiplying the quickness of each character by a certain random number and sorting.
Now let's implement the game itself. First is the character class.
dq_battle.py
class Character(object):
"""Character class"""
ACTIONS = {0: "attack", 1: "recovery"}
def __init__(self, hp, max_hp, attack, defence, agillity, intelligence, name):
self.hp = hp #Current HP
self.max_hp = max_hp #Maximum HP
self.attack = attack #Offensive power
self.defence = defence #Defense power
self.agillity = agillity #Agility
self.intelligence = intelligence #Wise
self.name = name #Character name
#Returns a status string
def get_status_s(self):
return "[{}] HP:{}/{} ATK:{} DEF:{} AGI:{} INT:{}".format(
self.name, self.hp, self.max_hp, self.attack, self.defence, self.agillity, self.intelligence)
def action(self, target, action):
#attack
if action == 0:
#Offensive power-Defensive damage calculation
damage = self.attack - target.defence
draw_damage = damage #For logs
#If the opponent's remaining HP is less than the amount of damage, the damage is just the remaining HP
if target.hp < damage:
damage = target.hp
#Cause damage
target.hp -= damage
#Returns the battle log
return "{}Is{}To{}Damaged".format(
self.name, target.name, draw_damage)
#recovery
elif action == 1:
#Use the amount of recovery as the INT value
heal_points = self.intelligence
draw_heal_points = heal_points #For logs
#If you can recover to maximum HP, maximum HP-Use the current HP as the recovery amount
if self.hp + heal_points > self.max_hp:
heal_points = self.max_hp - self.hp
#recovery
self.hp += heal_points
#Returns the battle log
return "{}HP{}Recovered".format(
self.name, draw_heal_points)
Since the battle design this time is simple, we have put it together in one class without distinguishing between the player and the enemy.
Each character (Brave and Demon King)
--HP (physical strength) --ATTACK (attack power) --DEFENCE --AGILITY (quickness) --INTELLIGENCE
Has the status of.
Damage calculation in "attack" is
It is calculated by the simple formula. Also, the amount of recovery with the "Recovery" command is the same as the value of cleverness.
Next, we will implement the battle body. First, you need to understand the big picture (state transition) of the battle.
dq_battle.py
class GameState(Enum):
"""Game state management class"""
TURN_START = auto() #Start turn
COMMAND_SELECT = auto() #Command selection
TURN_NOW = auto() #During the turn (each character action)
TURN_END = auto() #End of turn
GAME_END = auto() #Game over
As mentioned above for the battle ** "Start turn" "Select command" "During turn" "End turn" "End game" ** There are five states.
The state transition diagram is as shown below.
In this way, the basics of battle design is to loop the transition from the "turn start" state to the "turn end" state endlessly until the "game end" state (until the HP of the hero or the demon king becomes 0). Will be.
Now, let's implement the battle itself. Let's look at the entire code first.
dq_battle.py
class Game():
"""Game body"""
HERO_MAX_HP = 20
MAOU_MAX_HP = 50
def __init__(self):
#Generate a character
self.hero = Character(
Game.HERO_MAX_HP, Game.HERO_MAX_HP, 4, 1, 5, 7, "Brave")
self.maou = Character(
Game.MAOU_MAX_HP, Game.MAOU_MAX_HP, 5, 2, 6, 3, "Devil")
#Add to character list
self.characters = []
self.characters.append(self.hero)
self.characters.append(self.maou)
#Define variables for state transitions
self.game_state = GameState.TURN_START
#Number of turns
self.turn = 1
#A string to save the battle log
self.log = ""
#Advance the game every turn
def step(self, action):
#Main loop
while (True):
if self.game_state == GameState.TURN_START:
self.__turn_start()
elif self.game_state == GameState.COMMAND_SELECT:
self.__command_select(action) #Pass the action
elif self.game_state == GameState.TURN_NOW:
self.__turn_now()
elif self.game_state == GameState.TURN_END:
self.__turn_end()
break #Exit the loop at the end of the turn
elif self.game_state == GameState.GAME_END:
self.__game_end()
break
#Whether the game is over
done = False
if self.game_state == GameState.GAME_END:
done = True
#Returns "state s, reward r, game end"
return (self.hero.hp, self.maou.hp), self.reward, done
#Initialize the game to the state of the first turn
def reset(self):
self.__init__()
return (self.hero.hp, self.maou.hp)
#Draw battle log
def draw(self):
print(self.log, end="")
def __turn_start(self):
#State transition
self.game_state = GameState.COMMAND_SELECT
#Initialize log
self.log = ""
#drawing
s = " ***turn" + str(self.turn) + " ***"
self.__save_log("\033[36m{}\033[0m".format(s))
self.__save_log(self.hero.get_status_s())
self.__save_log(self.maou.get_status_s())
def __command_select(self, action):
#Action selection
self.action = action
#Random number 0 for the character.5〜1.Sort by quickness of 5 and store in queue
self.character_que = deque(sorted(self.characters,
key=lambda c: c.agillity*random.uniform(0.5, 1.5)))
#State transition
self.game_state = GameState.TURN_NOW
#Log save
self.__save_log("Command selection-> " + Character.ACTIONS[self.action])
def __turn_now(self):
#Sequential action from the character queue
if len(self.character_que) > 0:
now_character = self.character_que.popleft()
if now_character is self.hero:
s = now_character.action(self.maou, self.action)
elif now_character is self.maou:
s = now_character.action(self.hero, action=0) #Demon King always attacks
#Save log
self.__save_log(s)
#Game end if HP is 0 or less
for c in self.characters:
if c.hp <= 0:
self.game_state = GameState.GAME_END
return
#Turn end when everyone finishes action
if len(self.character_que) == 0:
self.game_state = GameState.TURN_END
return
def __turn_end(self):
#Set reward
self.reward = 0
#Initialize character queue
self.character_que = deque()
#Turn progress
self.turn += 1
#State transition
self.game_state = GameState.TURN_START
def __game_end(self):
if self.hero.hp <= 0:
self.__save_log("\033[31m{}\033[0m".format("The hero is dead"))
self.reward = -1 #Set reward
elif self.maou.hp <= 0:
self.__save_log("\033[32m{}\033[0m".format("Defeated the Demon King"))
self.reward = 1 #Set reward
self.__save_log("-----Game end-----")
def __save_log(self, s):
self.log += s + "\n"
The code is a bit long, but there are only two important parts of Q-learning.
The first is the step () method. This is the main part of the battle.
dq_battle.py
#Advance the game every turn
def step(self, action):
#Main loop
while (True):
if self.game_state == GameState.TURN_START:
self.__turn_start()
elif self.game_state == GameState.COMMAND_SELECT:
self.__command_select(action) #Pass the action
elif self.game_state == GameState.TURN_NOW:
self.__turn_now()
elif self.game_state == GameState.TURN_END:
self.__turn_end()
break #Exit the loop at the end of the turn
elif self.game_state == GameState.GAME_END:
self.__game_end()
break
#Whether the game is over
done = False
if self.game_state == GameState.GAME_END:
done = True
#Returns "state s, reward r, game end"
return (self.hero.hp, self.maou.hp), self.reward, done
Basically, the process flow is the same as the state transition diagram described above.
However, in Q-learning, the current state must be evaluated ** every turn **, so you must exit the main loop not only in the "game end" state but also in the "turn end" state.
In the "end of turn" state, the variables that must be evaluated for Q-learning are:
There are three.
Whether it is the end of the game is simply determined by whether the HP of the hero or the HP of the Demon King has become 0.
We need to think a little about the state s. There are multiple stats such as offensive power and defensive power, but there are only two stats that should be evaluated in Q-learning: "Brave HP" and "Devil's HP".
In this battle design, the values such as attack power and defense power are always constant, so there is no need to evaluate stats other than HP. Conversely, if the status changes due to buffing, debuffing, etc., that information is also required.
The reward r is evaluated in each of the "end of turn" and "end of game" states.
dq_battle.py
def __turn_end(self):
#Set reward
self.reward = 0
#(abridgement)
def __game_end(self):
if self.hero.hp <= 0:
self.__save_log("\033[31m{}\033[0m".format("The hero is dead"))
self.reward = -1 #Set reward
elif self.maou.hp <= 0:
self.__save_log("\033[32m{}\033[0m".format("Defeated the Demon King"))
self.reward = 1 #Set reward
The reward for the passage of turns is 0. If you are conscious of the purpose of "defeating the Demon King at the fastest speed", you can make the reward after the turn negative. (However, it is difficult to set appropriate parameters.)
At the end of the game, if the hero is defeated, the reward will be "-1", and if the demon king is defeated, the reward will be "+1".
The second important part is the reset () method.
dq_battle.py
#Initialize the game to the state of the first turn
def reset(self):
self.__init__()
return (self.hero.hp, self.maou.hp)
It's just a method to initialize the game. In addition, it is necessary to return the initial state for Q learning.
Together with the step () method above,
** Game initialization (reset) → Advance the turn until the battle ends (step) → Game initialization (reset) → Advance the turn until the battle ends (step) ・ ・ ・ **
You can proceed with learning by repeating the game.
The above is the basic part of the game for Q-learning.
Q-learning is implemented within the agent class. An agent is a class like a player who actually plays a game.
Since the agent is the player himself, he can choose the action (attack or recovery) and know the state (such as the HP of the hero or the demon king), but It is not possible to know the internal information of the game (random numbers that determine the order of actions, etc.).
Learning proceeds only from "behavior" and the "state" and "reward" obtained by that action. This is a basic understanding of reinforcement learning in general, including Q-learning.
First, I will post the entire agent class.
q-learning.py
DIV_N = 10
class Agent:
"""Agent class"""
def __init__(self, epsilon=0.2):
self.epsilon = epsilon
self.Q = []
#Ε policy-Defined by greedy method
def policy(self, s, actions):
if np.random.random() < self.epsilon:
#Random behavior with epsilon probability
return np.random.randint(len(actions))
else:
#(If Q contains the state s and the Q value in that state is not 0)
if s in self.Q and sum(self.Q[s]) != 0:
#Act so that the Q value is maximized
return np.argmax(self.Q[s])
else:
return np.random.randint(len(actions))
#Convert state to number
def digitize_state(self, s):
hero_hp, maou_hp = s
#DIV each of the HP of the hero and the demon king_Divide by N
s_digitize = [np.digitize(hero_hp, np.linspace(0, dq_battle.Game.HERO_MAX_HP, DIV_N + 1)[1:-1]),
np.digitize(maou_hp, np.linspace(0, dq_battle.Game.MAOU_MAX_HP, DIV_N + 1)[1:-1])]
# DIV_Returns the number of states up to the square of N
return s_digitize[0] + s_digitize[1]*DIV_N
#Q Learn
def learn(self, env, actions, episode_count=1000, gamma=0.9, learning_rate=0.1):
self.Q = defaultdict(lambda: [0] * len(actions))
# episode_Battle for count
for e in range(episode_count):
#Reset the game environment
tmp_s = env.reset()
#Convert current state to number
s = self.digitize_state(tmp_s)
done = False
#Repeat the action until the end of the game
while not done:
# ε-Choose an action according to the greedy policy
a = self.policy(s, actions)
#Advance the game for one turn and return the "state, reward, game end" at that time
tmp_s, reward, done = env.step(a)
#Convert state to number
n_state = self.digitize_state(tmp_s)
#Value gained by action a(gain) =Immediate reward+Time discount rate*Maximum Q value in the following states
gain = reward + gamma * max(self.Q[n_state])
#Q value currently being estimated (before learning)
estimated = self.Q[s][a]
#Update the Q value based on the current estimated value and the actual value when performing action a
self.Q[s][a] += learning_rate * (gain - estimated)
#Change the current state to the next state
s = n_state
What's a little confusing in the agent class is the method that converts the state to a number.
q-learning.py
#Convert state to number
def digitize_state(self, s):
hero_hp, maou_hp = s
#DIV each of the HP of the hero and the demon king_Divide into N
s_digitize = [np.digitize(hero_hp, np.linspace(0, dq_battle.Game.HERO_MAX_HP, DIV_N + 1)[1:-1]),
np.digitize(maou_hp, np.linspace(0, dq_battle.Game.MAOU_MAX_HP, DIV_N + 1)[1:-1])]
# DIV_Returns the number of states up to the square of N
return s_digitize[0] + s_digitize[1]*DIV_N
As I mentioned briefly earlier, there are two state variables that should be evaluated when learning Q-learning: "Brave HP" and "Devil's HP". However, in Q-learning, the state must be represented as a single number. In other words, the image is as follows.
--State 1: (Brave HP, Demon King HP) = (0, 0) --State 2: (Brave HP, Demon King HP) = (0, 1) --State 3: (Brave HP, Demon King HP) = (0, 2)
You can convert it as above, but this will increase the status by the number of HP x HP. Like a certain national RPG who is not Dragon Quest, if the HP is 4 digits, the number of states exceeds 1 million and it is difficult (laugh). Therefore, let's divide the state according to the ratio of HP.
np.digitize(hero_hp, np.linspace(0, dq_battle.Game.HERO_MAX_HP, DIV_N + 1)[1:-1]
A quick explanation of this code With np.linspace (), divide from 0 to maximum HP into N, This is an image that returns the number of divisions the current HP belongs to with np.digitize ().
Since N = 10 this time,
--HP is less than 10% → 0 --HP is 10% or more, less than 20% → 1 --HP is 20% or more, less than 30% → 2
It will be converted like this. In addition
"Brave state (0-9) + Demon King state (0-9) * 10" By calculating, the number of states can be suppressed to 100 from 0 to 99.
If the state is "15", you can intuitively understand that the Demon King's HP is less than "10"% and the Hero's HP is less than "50"%.
The policy is ε-greedy.
q-learning.py
#Ε policy-Defined by greedy method
def policy(self, s, actions):
if np.random.random() < self.epsilon:
#Random behavior with epsilon probability
return np.random.randint(len(actions))
else:
#(If Q contains the state s and the Q value in that state is not 0)
if s in self.Q and sum(self.Q[s]) != 0:
#Act so that the Q value is maximized
return np.argmax(self.Q[s])
else:
return np.random.randint(len(actions))
To briefly explain for beginners, it is basically a policy to decide the action so that the action value is maximized and adopt a random action with a probability of ε.
By giving a certain degree of randomness to the behavior, various behaviors are searched, so appropriate learning is possible without depending on the initial value of the Q value.
By the way, we have all the variables and methods necessary for Q-learning.
The Q-learning algorithm is as follows.
As mentioned at the beginning of the article, I will not explain the theory of Q-learning, so let's implement the above algorithm obediently.
q-learning.py
#Q Learn
def learn(self, env, actions, episode_count=1000, gamma=0.9, learning_rate=0.1):
self.Q = defaultdict(lambda: [0] * len(actions))
# episode_Battle for count
for e in range(episode_count):
#Reset the game environment
tmp_s = env.reset()
#Convert current state to number
s = self.digitize_state(tmp_s)
done = False
#Repeat the action until the end of the game
while not done:
# ε-Choose an action according to the greedy policy
a = self.policy(s, actions)
#Advance the game for one turn and return the "state, reward, game end" at that time
tmp_s, reward, done = env.step(a)
#Convert state to number
n_state = self.digitize_state(tmp_s)
#Value gained by action a(gain) =Immediate reward+Time discount rate*Maximum Q value in the following states
gain = reward + gamma * max(self.Q[n_state])
#Q value currently being estimated (before learning)
estimated = self.Q[s][a]
#Update the Q value based on the current estimated value and the actual value when performing action a
self.Q[s][a] += learning_rate * (gain - estimated)
#Change the current state to the next state
s = n_state
This completes the implementation of the game and Q-learning.
Q Before learning, let's try what happens if you battle the hero's actions randomly.
Add the following code.
q-learning.py
class Agent:
#(abridgement)
#Test battle
def test_run(self, env, actions, draw=True, episode_count=1000):
turn_num = 0 #Number of defeat turns
win_num = 0 #Number of wins
# episode_Battle for count
for e in range(episode_count):
tmp_s = env.reset()
s = self.digitize_state(tmp_s)
done = False
while not done:
a = self.policy(s, actions)
n_state, _, done = env.step(a)
s = self.digitize_state(n_state)
if draw:
env.draw() #Draw battle log
if env.maou.hp <= 0:
win_num += 1
turn_num += env.turn
#Outputs average winning percentage and average number of defeated turns
if not win_num == 0:
print("Average win rate{:.2f}%".format(win_num*100/episode_count))
print("Average number of defeat turns:{:.2f}".format(turn_num / win_num))
else:
print("Average win rate 0%")
if __name__ == "__main__":
game = dq_battle.Game()
agent = Agent()
actions = dq_battle.Character.ACTIONS
"""Completely random battle"""
agent.epsilon = 1.0
agent.test_run(game, actions, episode_count=1000)
By setting ε = 1.0, we are making the action 100% completely random. Also, I tried to calculate the average winning percentage and the average number of defeated turns from the results of 1000 battles.
Below are the execution results.
$ python q-learning.py
Average win rate 0.90%
Average number of defeat turns:64.89
The winning percentage is quite low ...
As you can see from the number of turns, it tends to be a long-term battle. The longer the battle, the more dying the hero will be, and as a result, it is expected that it will be difficult to win.
Add the following code.
q-learning.py
if __name__ == "__main__":
#(abridgement)
"""Q learn"""
agent.epsilon = 0.2
agent.learn(game, actions, episode_count=1000)
"""Test battle"""
agent.epsilon = 0
agent.test_run(game, actions, episode_count=1000)
Let's set ε = 0.2 and execute Q-learning.
After that, 1000 test battles will be held. By setting ε = 0 (0% random), the behavior is performed according to the learned behavior value.
Below, the execution results are shown by changing the number of battles to be learned.
** Execution result (number of learning battles: 50, number of test battles: 1000) **
$ python q-learning.py
Average win rate 42.60%
Average number of defeat turns:56.19
** Execution result (500 learning battles, 1000 test battles) **
$ python q-learning.py
Average win rate 100.00%
Average number of defeat turns:55.00
** Execution result (5000 learning battles, 1000 test battles) **
$ python q-learning.py
Average win rate 100.00%
Average number of defeat turns:54.00
The winning percentage is 100%!
Let's consider a little. Let's see the Q value of the learned result.
Below, the Q value when learning with 1000 battles is extracted for some states.
State 50:[-0.19, -0.1]
State 51:[-0.6623164987957537, -0.34788781183605283]
State 52:[-0.2711479211007827, 0.04936802595531123]
State 53:[-0.36097806076138395, 0.11066249745943924]
State 54:[-0.04065992616558749, 0.12416469852733954]
State 55:[0.17619052640036173, 0.09475948937059306]
State 56:[0.10659739434775867, 0.05112985778828942]
State 57:[0.1583472103200607, 0.016092008419030468]
State 58:[0.04964633744625512, 0.0020759614034820224]
State 59:[0.008345513895442138, 0.0]
As for how to see the state, the 10th place is the remaining HP of the Demon King, and the 1st place is the remaining HP of the hero. In other words, the above figure shows how the action value changes depending on the remaining HP of the hero when the remaining HP of the Demon King is about 50%.
From the figure, it can be seen that if the remaining HP (1st place) of the hero is low, the "Recovery" command is selected, and if the remaining HP is high, the "Attack" command is selected.
Let's also look at the Q value when the remaining HP of the hero is fixed.
State 07:[2.023809062133135, 0.009000000000000001]
State 17:[1.8092946131557912, 0.8310497919226313]
State 27:[0.8223927076749513, 0.5279685031058523]
State 37:[0.5565475393122992, 0.29257906153106145]
State 47:[0.25272081107828437, 0.26657637207739293]
State 57:[0.14094053800308323, 0.1533527340827757]
State 67:[0.0709128688771915, 0.07570873469406877]
State 77:[0.039059851207044236, 0.04408123679644829]
State 87:[0.023028972190011696, 0.02386492692407677]
State 97:[0.016992303227705185, 0.0075795064515745995]
The figure above shows how the action value changes depending on the remaining HP of the Demon King when the remaining HP of the hero is about 70%. You can see that the less HP the Demon King has, the more "attack" it has.
Since this article is mainly implemented, other considerations will be omitted. If you can afford it, it will be interesting to try learning by changing the hyperparameters or trying to make the battle rules more complicated.
Also, since the author is a beginner in reinforcement learning, please feel free to point out any mistakes. I am glad that my knowledge has been strengthened.
The source is on github. https://github.com/nanoseeing/DQ_Q-learning
-[Machine Learning Startup Series Enhanced Learning with Python [Revised 2nd Edition] From Introduction to Practice](https://www.amazon.co.jp/%E6%A9%9F%E6%A2%B0%E5% AD% A6% E7% BF% 92% E3% 82% B9% E3% 82% BF% E3% 83% BC% E3% 83% 88% E3% 82% A2% E3% 83% 83% E3% 83% 97% E3% 82% B7% E3% 83% AA% E3% 83% BC% E3% 82% BA-Python% E3% 81% A7% E5% AD% A6% E3% 81% B6% E5% BC% B7% E5% 8C% 96% E5% AD% A6% E7% BF% 92-% E6% 94% B9% E8% A8% 82% E7% AC% AC2% E7% 89% 88-% E5% 85% A5% E9% 96% 80% E3% 81% 8B% E3% 82% 89% E5% AE% 9F% E8% B7% B5% E3% 81% BE% E3% 81% A7-% E4% B9% 85 % E4% BF% 9D / dp / 4065712519 / ref = sr_1_3? __mk_ja_JP =% E3% 82% AB% E3% 82% BF% E3% 82% AB% E3% 83% 8A & keywords =% E5% BC% B7% E5 % 8C% 96% E5% AD% A6% E7% BF% 92 & qid = 1581149235 & s = books & sr = 1-3) -[Learn while making! Deep reinforcement learning ~ Practical programming with PyTorch ~](https://www.amazon.co.jp/%E3%81%A4%E3%81%8F%E3%82%8A%E3% 81% AA% E3% 81% 8C% E3% 82% 89% E5% AD% A6% E3% 81% B6-% E6% B7% B1% E5% B1% A4% E5% BC% B7% E5% 8C % 96% E5% AD% A6% E7% BF% 92-PyTorch% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E5% AE% 9F% E8% B7% B5% E3% 83 % 97% E3% 83% AD% E3% 82% B0% E3% 83% A9% E3% 83% 9F% E3% 83% B3% E3% 82% B0-% E6% A0% AA% E5% BC% 8F% E4% BC% 9A% E7% A4% BE% E9% 9B% BB% E9% 80% 9A% E5% 9B% BD% E9% 9A% 9B% E6% 83% 85% E5% A0% B1% E3% 82% B5% E3% 83% BC% E3% 83% 93% E3% 82% B9-% E5% B0% 8F% E5% B7% 9D% E9% 9B% 84% E5% A4% AA% E9 % 83% 8E / dp / 4839965625 / ref = sr_1_4? __mk_ja_JP =% E3% 82% AB% E3% 82% BF% E3% 82% AB% E3% 83% 8A & keywords =% E5% BC% B7% E5% 8C % 96% E5% AD% A6% E7% BF% 92 & qid = 1581149235 & s = books & sr = 1-4)
Recommended Posts