PPO is one of the deep reinforcement learning algorithms, among which it is policy-based. A policy base is an algorithm that directly optimizes a policy function that outputs an action probability given an environment. Other policy-based algorithms include A3C and TRPO. Algorithms other than policy-based algorithms include value-based algorithms such as DQN.
First, I will explain about other methods.
There are three typical technologies used in A3C: Actor-Critic, Advantage, and Asynchronous.
Actor-Critic Actor-Critic is a feature of network structure. In A3C as the objective function of the policy
L_{policy}=A(t)log\pi_{\theta}(a_{t}|s_{t})
Is used. Of these, A (t) is called the advantage function.
A(t)=(R(t)-V(s_{t}))
It is represented by. Since the value function V (s) is used to obtain this A (t), it is a method of constructing a model so that the state value becomes the output of the network at the same time as the policy (action probability distribution). This will allow you to learn faster.
Advantage Use the following error to update the normal state value function $ V (t) $
loss=r(s_{t})+\gamma V(s_{t+1})-V(s_{t})
We will learn $ V (s) $ to satisfy. However, with this method, when the number of steps in one episode is large, the number of learnings required to propagate to the step where the influence of learning is early increases, and learning becomes slow. Therefore, Advantage uses the following error.
loss=\sum_{k=1}^n \gamma^{k-1} r(s_{t+k})+\gamma^n V(s_{t+2})-V(s_{t})
By adjusting $ n $ in this equation, the effects of earlier learning will propagate faster, but if it is too large, the learning speed will slow down. For example, CartPole is not very effective and in some cases it may be faster to not use Advantage. Asynchronous Asynchronous is a method related to learning methods. Normally, when searching with one agent, the direction of learning tends to be biased. As a countermeasure, search is performed using one shared neural network and multiple agents, and each has a fixed number of steps. Over time, or at the end of an episode, each agent updates the shared network parameters by finding the gradient for the objective function parameters.
Each agent also has a neural network that copies parameters from the shared network and searches before the episode begins. This will prevent learning bias, similar to DQN's replay buffer.
The objective function of A3C uses three methods: policy, state value, and entropy for regularization.
\begin{align}
L_{policy} &= A(t)log\pi_{\theta}(a_{t}|s_{t})\\
L_{value} &= (R(t)-V(s_{t}))^2\\
L_{entropy} &= \pi_{\theta}(a_{t}|s_{t})log\pi_{\theta}(a_{t}|s_{t})
\end{align}
It is expressed as, and finally it becomes the following formula in the form of combining these.
L_{all} = -L_{policy}+C_{value}L_{value}-C_{entropy}L_{entropy}
$ C_ {value} $ and $ C_ {entropy} $ represent constants. Learning is done by minimizing this.
In A3C, the policy gradient is
\Delta L_{policy} = \Delta log\pi_{\theta}(a_{t}|s_{t})A(t)
It is expressed as, and since $ log $ is included in the expression, it becomes very large when updating. Therefore, in PPO, by restricting the update, it is possible to prevent over-update. The objective function is also very different from A3C.
r_{t}(\theta)=\frac{\pi_{\theta_{new}}(a_{t}|s_{t})}{\pi_{\theta_{old}}(a_{t}|s_{t})}\\
L^{CPI}=\mathbb E \big[\,r_{t}(\theta)A(t)\, \big]
Use this as a surrogate objective function, and use the clip function when updating to make it the objective function of the policy. clip function
clip(x,a,b)=\left\{
\begin{array}{ll}
b & (x > b) \\
x & (a \leq x \leq b) \\
a & (x < a)
\end{array}
\right.
Expressed this way, no matter how $ x $ changes, it will fit between $ a $ and $ b $. The objective function using this function
L_{policy}=min \big(\, r_{t}(\theta)A(t),clip(r_{t}(\theta),1-\epsilon,1+\epsilon)\, \big)
Is expressed as. Regarding the state value function, it is almost the same as PPO.
In addition, PPO also learns using Advantage in the same way as A3C.
I referred to the following site when implementing [Reinforcement learning] PPO to learn while implementing [Stick with Cart Pole: Complete with 1 file]
main (): Creates a thread and performs processing
Worker(thread_type, thread_name, ppo_brain) -run_thread (): Separate processing according to thread type -env_run (): Let the agent search in the environment.
ppo_agent(ppo_brain) -action (state): Receives state and outputs action -greedy_action (state): Output action using $ \ epsilon-greedy $ method -update (memory): Learn the data saved during the search in consideration of the advantage ppo_brain() -build_graph: Define the shape of the graph here -update: update
main
def main(args):
#Process to create a thread
with tf.device("/cpu:0"):
brain = ppo_brain()
thread=[]
for i in range(WORKER_NUM):
thread_name = "local_thread"+str(i)
thread.append(Worker(thread_type = "train",thread_name = thread_name,brain = brain))
COORD = tf.train.Coordinator()
SESS.run(tf.global_variables_initializer())
saver = tf.train.Saver()
#Load the previous training process, basically do this after defining the model
if args.load:
ckpt = tf.train.get_checkpoint_state(MODEL_DIR)
if ckpt:
saver.restore(SESS,MODEL_SAVE_PATH)
runnning_thread=[]
for worker in thread:
job = lambda: worker.run_thread()
t = threading.Thread(target=job)
t.start()
runnning_thread.append(t)
COORD.join(runnning_thread)
#Do a test when learning is over
test = Worker(thread_type = "test",thread_name = "test_thread",brain=brain)
test.run_thread()
if args.save:
saver.save(SESS,MODEL_SAVE_PATH)
print("saved")
Worker
class Worker:
def __init__(self,thread_type,thread_name,brain):
self.thread_type = thread_type
self.name = thread_name
self.agent = ppo_agent(brain)
self.env = gym.make(ENV_NAME)
#Save the video at the time of test
if self.thread_type == "test" and args.video:
self.env = wrappers.Monitor(self.env, VIDEO_DIR, force = True)
self.leaning_memory = np.zeros(10)
self.memory = []
self.total_trial = 0
def run_thread(self):
while True:
if self.thread_type == "train" and not isLearned:
self.env_run()
elif self.thread_type == "train" and isLearned:
sleep(3)
break
elif self.thread_type == "test" and not isLearned:
sleep(3)
elif self.thread_type == "test" and isLearned:
self.env_run()
break
def env_run(self):
global isLearned
global frame
self.total_trial += 1
step = 0
observation = self.env.reset()
while True:
step += 1
frame += 1
#Action choice
if self.thread_type == "train":
action=self.agent.greedy_action(observation)
elif self.thread_type == "test":
self.env.render()
sleep(0.01)
action=self.agent.action(observation)
next_observation,_,done,_ = self.env.step(action)
reward = 0
if done:
if step >= 199:
reward = 1 #At the time of success
else:
reward =- 1 #At the time of failure
else:
#When it's not over
reward+=0
#Save the result in memory
self.memory.append([observation,action,reward,done,next_observation])
observation = next_observation
if done:
break
#Calculate the average score of 10 times
self.leaning_memory = np.hstack((self.leaning_memory[1:],step))
print("Thread:",self.name," Thread_trials:",self.total_trial," score:",step," mean_score:",self.leaning_memory.mean()," total_step:",frame)
#At the end of learning
if self.leaning_memory.mean() >= 199:
isLearned = True
sleep(3)
else:
#Parameter update
self.agent.update(self.memory)
self.memory = []
ppo_agent
class ppo_agent:
def __init__(self,brain):
self.brain=brain
self.memory=[]
#Act without random elements
def action(self,state):
prob,v = self.brain.predict(state)
return np.random.choice(ACTION_LIST,p = prob)
#Randomly act with a certain probability
def greedy_action(self,state):
if frame >= EPS_STEPS:
eps = EPS_END
else:
eps = EPS_START + frame* (EPS_END - EPS_START) / EPS_STEPS
if np.random.random() <= eps:
return np.random.choice(ACTION_LIST)
else:
return self.action(state)
#Process the search result and ppo_Send to brain class
def update(self,memory):
R = sum([memory[j][2] * (GAMMA ** j) for j in range(ADVANTAGE + 1)])
self.memory.append([memory[0][0], memory[0][1], R,memory[0][3], memory[0][4], GAMMA ** ADVANTAGE])
#Consider the advantage
for i in range(1, len(memory) - ADVANTAGE):
R = ((R - memory[i-1][2]) / GAMMA) + memory[i + ADVANTAGE][2] * (GAMMA ** (ADVANTAGE - 1))
self.memory.append([memory[i][0], memory[i][1], R,memory[i + ADVANTAGE][3], memory[i][4],GAMMA ** ADVANTAGE])
for i in range(ADVANTAGE - 1):
R = ((R - memory[len(memory) - ADVANTAGE + i][2]) / GAMMA)
self.memory.append([memory[i][0], memory[i][1], R, True, memory[i][4], GAMMA ** (ADVANTAGE - i)])
#ppo_Send data to brain class to update
self.brain.update(self.memory)
self.memory = []
ppo_brain
class ppo_brain:
def __init__(self):
self.build_model()
self.name="brain"
self.prob_old=1.0
def build_model(self):
self.input=tf.placeholder(dtype=tf.float32,shape=[None,STATE_NUM])
#Here, the model of the old parameter and the model of the new parameter are defined, and the action probability and the state value are output for the same input.
#New network
with tf.variable_scope("current_brain"):
hidden1=tf.layers.dense(self.input,HIDDEN_LAYERE,activation=tf.nn.leaky_relu)
self.prob=tf.layers.dense(hidden1,ACTION_NUM,activation=tf.nn.softmax)
self.v=tf.layers.dense(hidden1,1)
#Old network
with tf.variable_scope("old_brain"):
old_hiddend1=tf.layers.dense(self.input,HIDDEN_LAYERE,activation=tf.nn.leaky_relu)
self.old_prob=tf.layers.dense(hidden1,ACTION_NUM,activation=tf.nn.softmax)
self.old_v=tf.layers.dense(hidden1,1)
self.reward=tf.placeholder(dtype=tf.float32,shape=(None,1))
self.action=tf.placeholder(dtype=tf.float32,shape=(None,ACTION_NUM))
###########Below is the definition of the loss function############
#Definition of advantage function
advantage = self.reward-self.v
#Definition part of the loss function of the policy
r_theta = tf.div(self.prob + 1e-10, tf.stop_gradient(self.old_prob) + 1e-10)
action_theta = tf.reduce_sum(tf.multiply(r_theta, self.action), axis = 1, keepdims = True)
#Calculate clip of r
r_clip = tf.clip_by_value(action_theta, 1 - EPSIRON, 1 + EPSIRON)
#When using the advantage function as the objective function of the policy, the gradient is not considered, so stop_Use gradient
advantage_cpi = tf.multiply(action_theta, tf.stop_gradient(advantage))
advantage_clip = tf.multiply(r_clip , tf.stop_gradient(advantage))
self.policy_loss = tf.minimum(advantage_clip , advantage_cpi)
#State value loss function
self.value_loss = tf.square(advantage)
#Definition of entropy
self.entropy = tf.reduce_sum(self.prob*tf.log(self.prob+1e-10),axis = 1,keepdims = True)
#Definition of the final loss function
self.loss = tf.reduce_sum(-self.policy_loss + LOSS_V * self.value_loss - LOSS_ENTROPY * self.entropy)
##############The following defines the actions required for updating##############
#Parameter update (minimized using Adam)
self.opt = tf.train.AdamOptimizer(learning_rate = LEARNING_RATE)
self.minimize = self.opt.minimize(self.loss)
#Get new and old parameters from their respective networks
self.weight_param = tf.get_collection(key = tf.GraphKeys.TRAINABLE_VARIABLES, scope = "current_brain")
self.old_weight_param = tf.get_collection(key = tf.GraphKeys.TRAINABLE_VARIABLES, scope = "old_brain")
#Substitute new network parameters for old network parameters
self.insert = [g_p.assign(l_p) for l_p,g_p in zip(self.weight_param,self.old_weight_param)]
#Output action probability and state value from state
def predict(self,state):
state=np.array(state).reshape(-1,STATE_NUM)
feed_dict={self.input:state}
p,v=SESS.run([self.prob,self.v],feed_dict)
return p.reshape(-1),v.reshape(-1)
#Create a batch by preprocessing before entering data
#Update
def update(self,memory):
length=len(memory)
s_=np.array([memory[j][0] for j in range(length)]).reshape(-1,STATE_NUM)
a_=np.eye(ACTION_NUM)[[memory[j][1] for j in range(length)]].reshape(-1,ACTION_NUM)
R_=np.array([memory[j][2] for j in range(length)]).reshape(-1,1)
d_=np.array([memory[j][3] for j in range(length)]).reshape(-1,1)
s_mask=np.array([memory[j][5] for j in range(length)]).reshape(-1,1)
_s=np.array([memory[j][4] for j in range(length)]).reshape(-1,STATE_NUM)
#Infer the later state value
_, v=self.predict(_s)
#Calculate rewards considering advantage
R=(np.where(d_,0,1)*v.reshape(-1,1))*s_mask+R_
#Parameter update
feed_dict={self.input:s_, self.action:a_, self.reward:R}
SESS.run(self.minimize,feed_dict)
#Network update
SESS.run(self.insert)
The whole code looks like this:
import argparse
import tensorflow as tf
import numpy as np
import random
import threading
import gym
from time import sleep
from gym import wrappers
from os import path
parser=argparse.ArgumentParser(description="Reiforcement training with PPO",add_help=True)
parser.add_argument("--model",type=str,required=True,help="model base name. required")
parser.add_argument("--env_name",default="CartPole-v0",help="environment name. default is CartPole-v0")
parser.add_argument("--save",action="store_true",default=False,help="save command")
parser.add_argument("--load",action="store_true",default=False,help="load command")
parser.add_argument("--thread_num",type=int,default=5)
parser.add_argument("--video",action="store_true",default=False, help="write this if you want to save as video")
args=parser.parse_args()
ENV_NAME=args.env_name
WORKER_NUM=args.thread_num
#define constants
VIDEO_DIR="./train_info/video"
MODEL_DIR="./train_info/models"
MODEL_SAVE_PATH=path.join(MODEL_DIR,args.model)
ADVANTAGE=2
STATE_NUM=4
ACTION_LIST=[0,1]
ACTION_NUM=2
#epsiron parameter
EPS_START = 0.5
EPS_END = 0.1
EPS_STEPS = 200 * WORKER_NUM**2
#learning parameter
GAMMA=0.99
LEARNING_RATE=0.002
#loss constants
LOSS_V=0.5
LOSS_ENTROPY=0.02
HIDDEN_LAYERE=30
EPSIRON = 0.2
class ppo_brain:
def __init__(self):
self.build_model()
self.name="brain"
self.prob_old=1.0
def build_model(self):
self.input=tf.placeholder(dtype=tf.float32,shape=[None,STATE_NUM])
#Here, the model of the old parameter and the model of the new parameter are defined, and the action probability and the state value are output for the same input.
with tf.variable_scope("current_brain"):
hidden1=tf.layers.dense(self.input,HIDDEN_LAYERE,activation=tf.nn.leaky_relu)
self.prob=tf.layers.dense(hidden1,ACTION_NUM,activation=tf.nn.softmax)
self.v=tf.layers.dense(hidden1,1)
with tf.variable_scope("old_brain"):
old_hiddend1=tf.layers.dense(self.input,HIDDEN_LAYERE,activation=tf.nn.leaky_relu)
self.old_prob=tf.layers.dense(hidden1,ACTION_NUM,activation=tf.nn.softmax)
self.old_v=tf.layers.dense(hidden1,1)
self.reward=tf.placeholder(dtype=tf.float32,shape=(None,1))
self.action=tf.placeholder(dtype=tf.float32,shape=(None,ACTION_NUM))
###########Below is the definition of the loss function############
#Definition of advantage function
advantage = self.reward-self.v
#Definition part of the loss function of the policy
r_theta = tf.div(self.prob + 1e-10, tf.stop_gradient(self.old_prob) + 1e-10)
action_theta = tf.reduce_sum(tf.multiply(r_theta, self.action), axis = 1, keepdims = True)
r_clip = tf.clip_by_value(action_theta, 1 - EPSIRON, 1 + EPSIRON)
advantage_cpi = tf.multiply(action_theta, tf.stop_gradient(advantage))
advantage_clip = tf.multiply(r_clip , tf.stop_gradient(advantage))
self.policy_loss = tf.minimum(advantage_clip , advantage_cpi)
#State value loss function
self.value_loss = tf.square(advantage)
#Definition of entropy
self.entropy = tf.reduce_sum(self.prob*tf.log(self.prob+1e-10),axis = 1,keepdims = True)
#Definition of the final loss function
self.loss = tf.reduce_sum(-self.policy_loss + LOSS_V * self.value_loss - LOSS_ENTROPY * self.entropy)
##############The following defines the actions required for updating##############
#Parameter update (minimized using Adam)
self.opt = tf.train.AdamOptimizer(learning_rate = LEARNING_RATE)
self.minimize = self.opt.minimize(self.loss)
#Get new and old parameters from their respective networks
self.weight_param = tf.get_collection(key = tf.GraphKeys.TRAINABLE_VARIABLES, scope = "current_brain")
self.old_weight_param = tf.get_collection(key = tf.GraphKeys.TRAINABLE_VARIABLES, scope = "old_brain")
#Substitute new network parameters for old network parameters
self.insert = [g_p.assign(l_p) for l_p,g_p in zip(self.weight_param,self.old_weight_param)]
#Output action probability and state value from state
def predict(self,state):
state=np.array(state).reshape(-1,STATE_NUM)
feed_dict={self.input:state}
p,v=SESS.run([self.prob,self.v],feed_dict)
return p.reshape(-1),v.reshape(-1)
#Create a batch by preprocessing before entering data
#Update
def update(self,memory):
length=len(memory)
s_=np.array([memory[j][0] for j in range(length)]).reshape(-1,STATE_NUM)
a_=np.eye(ACTION_NUM)[[memory[j][1] for j in range(length)]].reshape(-1,ACTION_NUM)
R_=np.array([memory[j][2] for j in range(length)]).reshape(-1,1)
d_=np.array([memory[j][3] for j in range(length)]).reshape(-1,1)
s_mask=np.array([memory[j][5] for j in range(length)]).reshape(-1,1)
_s=np.array([memory[j][4] for j in range(length)]).reshape(-1,STATE_NUM)
#Infer the later state value
_, v=self.predict(_s)
#Calculate rewards considering advantage
R=(np.where(d_,0,1)*v.reshape(-1,1))*s_mask+R_
#Parameter update
feed_dict={self.input:s_, self.action:a_, self.reward:R}
SESS.run(self.minimize,feed_dict)
#Network update
SESS.run(self.insert)
class ppo_agent:
def __init__(self,brain):
self.brain=brain
self.memory=[]
#Act without random elements
def action(self,state):
prob,v = self.brain.predict(state)
return np.random.choice(ACTION_LIST,p = prob)
#Randomly act with a certain probability
def greedy_action(self,state):
if frame >= EPS_STEPS:
eps = EPS_END
else:
eps = EPS_START + frame* (EPS_END - EPS_START) / EPS_STEPS
if np.random.random() <= eps:
return np.random.choice(ACTION_LIST)
else:
return self.action(state)
#Process the search result and ppo_Send to brain class
def update(self,memory):
R = sum([memory[j][2] * (GAMMA ** j) for j in range(ADVANTAGE + 1)])
self.memory.append([memory[0][0], memory[0][1], R,memory[0][3], memory[0][4], GAMMA ** ADVANTAGE])
#Consider the advantage
for i in range(1, len(memory) - ADVANTAGE):
R = ((R - memory[i-1][2]) / GAMMA) + memory[i + ADVANTAGE][2] * (GAMMA ** (ADVANTAGE - 1))
self.memory.append([memory[i][0], memory[i][1], R,memory[i + ADVANTAGE][3], memory[i][4],GAMMA ** ADVANTAGE])
for i in range(ADVANTAGE - 1):
R = ((R - memory[len(memory) - ADVANTAGE + i][2]) / GAMMA)
self.memory.append([memory[i][0], memory[i][1], R, True, memory[i][4], GAMMA ** (ADVANTAGE - i)])
#ppo_Send data to brain class to update
self.brain.update(self.memory)
self.memory = []
class Worker:
def __init__(self,thread_type,thread_name,brain):
self.thread_type = thread_type
self.name = thread_name
self.agent = ppo_agent(brain)
self.env = gym.make(ENV_NAME)
#Save the video at the time of test
if self.thread_type == "test" and args.video:
self.env = wrappers.Monitor(self.env, VIDEO_DIR, force = True)
self.leaning_memory = np.zeros(10)
self.memory = []
self.total_trial = 0
def run_thread(self):
while True:
if self.thread_type == "train" and not isLearned:
self.env_run()
elif self.thread_type == "train" and isLearned:
sleep(3)
break
elif self.thread_type == "test" and not isLearned:
sleep(3)
elif self.thread_type == "test" and isLearned:
self.env_run()
break
def env_run(self):
global isLearned
global frame
self.total_trial += 1
step = 0
observation = self.env.reset()
while True:
step += 1
frame += 1
if self.thread_type == "train":
action=self.agent.greedy_action(observation)
elif self.thread_type == "test":
self.env.render()
sleep(0.01)
action=self.agent.action(observation)
next_observation,_,done,_ = self.env.step(action)
reward = 0
if done:
if step >= 199:
reward = 1 #At the time of success
else:
reward =- 1 #At the time of failure
else:
#When it's not over
reward+=0
#Save the result in memory
self.memory.append([observation,action,reward,done,next_observation])
observation = next_observation
if done:
break
#Calculate the average score of 10 times
self.leaning_memory = np.hstack((self.leaning_memory[1:],step))
print("Thread:",self.name," Thread_trials:",self.total_trial," score:",step," mean_score:",self.leaning_memory.mean()," total_step:",frame)
#At the end of learning
if self.leaning_memory.mean() >= 199:
isLearned = True
sleep(3)
else:
#Parameter update
self.agent.update(self.memory)
self.memory = []
def main(args):
#Process to create a thread
with tf.device("/cpu:0"):
brain = ppo_brain()
thread=[]
for i in range(WORKER_NUM):
thread_name = "local_thread"+str(i)
thread.append(Worker(thread_type = "train",thread_name = thread_name,brain = brain))
COORD = tf.train.Coordinator()
SESS.run(tf.global_variables_initializer())
saver = tf.train.Saver()
#Load the previous training process, basically do this after defining the model
if args.load:
ckpt = tf.train.get_checkpoint_state(MODEL_DIR)
if ckpt:
saver.restore(SESS,MODEL_SAVE_PATH)
runnning_thread=[]
for worker in thread:
job = lambda: worker.run_thread()
t = threading.Thread(target=job)
t.start()
runnning_thread.append(t)
COORD.join(runnning_thread)
#Do a test when learning is over
test = Worker(thread_type = "test",thread_name = "test_thread",brain=brain)
test.run_thread()
if args.save:
saver.save(SESS,MODEL_SAVE_PATH)
print("saved")
if __name__=="__main__":
SESS=tf.Session()
frame=0
isLearned=False
main(args)
print("end")
That's why I implemented PPO this time. It seems that a major feature of PPO is that it produces high results despite its simple mechanism. I did a little research on TRPO, but the mechanism seemed to be quite difficult, so I will omit a detailed explanation this time. Next, I would like to summarize the implementation of PPO in the continuous action space or other methods.
Recommended Posts