About PPO

PPO is one of the deep reinforcement learning algorithms, among which it is policy-based. A policy base is an algorithm that directly optimizes a policy function that outputs an action probability given an environment. Other policy-based algorithms include A3C and TRPO. Algorithms other than policy-based algorithms include value-based algorithms such as DQN.

Comparison with other methods

First, I will explain about other methods.

About A3C

There are three typical technologies used in A3C: Actor-Critic, Advantage, and Asynchronous.

Actor-Critic Actor-Critic is a feature of network structure. In A3C as the objective function of the policy

L_{policy}=A(t)log\pi_{\theta}(a_{t}|s_{t})

Is used. Of these, A (t) is called the advantage function.

A(t)=(R(t)-V(s_{t}))

It is represented by. Since the value function V (s) is used to obtain this A (t), it is a method of constructing a model so that the state value becomes the output of the network at the same time as the policy (action probability distribution). This will allow you to learn faster.

Advantage Use the following error to update the normal state value function $ V (t) $

loss=r(s_{t})+\gamma V(s_{t+1})-V(s_{t})

We will learn $ V (s) $ to satisfy. However, with this method, when the number of steps in one episode is large, the number of learnings required to propagate to the step where the influence of learning is early increases, and learning becomes slow. Therefore, Advantage uses the following error.

loss=\sum_{k=1}^n \gamma^{k-1} r(s_{t+k})+\gamma^n V(s_{t+2})-V(s_{t})

By adjusting $ n $ in this equation, the effects of earlier learning will propagate faster, but if it is too large, the learning speed will slow down. For example, CartPole is not very effective and in some cases it may be faster to not use Advantage. Asynchronous Asynchronous is a method related to learning methods. Normally, when searching with one agent, the direction of learning tends to be biased. As a countermeasure, search is performed using one shared neural network and multiple agents, and each has a fixed number of steps. Over time, or at the end of an episode, each agent updates the shared network parameters by finding the gradient for the objective function parameters.

Each agent also has a neural network that copies parameters from the shared network and searches before the episode begins. This will prevent learning bias, similar to DQN's replay buffer.

A3C objective function

The objective function of A3C uses three methods: policy, state value, and entropy for regularization.

\begin{align}
L_{policy} &= A(t)log\pi_{\theta}(a_{t}|s_{t})\\
L_{value} &= (R(t)-V(s_{t}))^2\\
L_{entropy} &= \pi_{\theta}(a_{t}|s_{t})log\pi_{\theta}(a_{t}|s_{t})
\end{align}

It is expressed as, and finally it becomes the following formula in the form of combining these.

L_{all} = -L_{policy}+C_{value}L_{value}-C_{entropy}L_{entropy}

$ C_ {value} $ and $ C_ {entropy} $ represent constants. Learning is done by minimizing this.

About PPO

In A3C, the policy gradient is

\Delta L_{policy} = \Delta log\pi_{\theta}(a_{t}|s_{t})A(t)

It is expressed as, and since $ log $ is included in the expression, it becomes very large when updating. Therefore, in PPO, by restricting the update, it is possible to prevent over-update. The objective function is also very different from A3C.

r_{t}(\theta)=\frac{\pi_{\theta_{new}}(a_{t}|s_{t})}{\pi_{\theta_{old}}(a_{t}|s_{t})}\\
L^{CPI}=\mathbb E \big[\,r_{t}(\theta)A(t)\, \big]

Use this as a surrogate objective function, and use the clip function when updating to make it the objective function of the policy. clip function

clip(x,a,b)=\left\{
\begin{array}{ll}
b & (x > b) \\
x & (a \leq x \leq b) \\
a & (x < a)
\end{array}
\right.

Expressed this way, no matter how $ x $ changes, it will fit between $ a $ and $ b $. The objective function using this function

L_{policy}=min \big(\, r_{t}(\theta)A(t),clip(r_{t}(\theta),1-\epsilon,1+\epsilon)\, \big)

Is expressed as. Regarding the state value function, it is almost the same as PPO.

In addition, PPO also learns using Advantage in the same way as A3C.

Implementation

I referred to the following site when implementing [Reinforcement learning] PPO to learn while implementing [Stick with Cart Pole: Complete with 1 file]

Overview of processing

main (): Creates a thread and performs processing

Worker(thread_type, thread_name, ppo_brain) -run_thread (): Separate processing according to thread type -env_run (): Let the agent search in the environment.

ppo_agent(ppo_brain) -action (state): Receives state and outputs action -greedy_action (state): Output action using $ \ epsilon-greedy $ method -update (memory): Learn the data saved during the search in consideration of the advantage ppo_brain() -build_graph: Define the shape of the graph here -update: update

main

def main(args):
    #Process to create a thread
    with tf.device("/cpu:0"):
        brain = ppo_brain()
        thread=[]
        for i in range(WORKER_NUM):
            thread_name = "local_thread"+str(i)
            thread.append(Worker(thread_type = "train",thread_name = thread_name,brain = brain))
    
    COORD = tf.train.Coordinator()
    SESS.run(tf.global_variables_initializer())
    saver = tf.train.Saver()
    
　　#Load the previous training process, basically do this after defining the model
    if args.load:
        ckpt = tf.train.get_checkpoint_state(MODEL_DIR)
        if ckpt:
            saver.restore(SESS,MODEL_SAVE_PATH)

    runnning_thread=[]
    for worker in thread:
        job = lambda: worker.run_thread()
        t = threading.Thread(target=job)
        t.start()
        runnning_thread.append(t)
    COORD.join(runnning_thread)
　　　
    #Do a test when learning is over
    test = Worker(thread_type = "test",thread_name = "test_thread",brain=brain)
    test.run_thread()

    if args.save:
        saver.save(SESS,MODEL_SAVE_PATH)
        print("saved")

Worker

class Worker:
    def __init__(self,thread_type,thread_name,brain):
        self.thread_type = thread_type
        self.name = thread_name
        self.agent = ppo_agent(brain)
        self.env = gym.make(ENV_NAME)
        #Save the video at the time of test
        if self.thread_type == "test" and args.video:
            self.env = wrappers.Monitor(self.env, VIDEO_DIR, force = True)
        self.leaning_memory = np.zeros(10)
        self.memory = []
        self.total_trial = 0

    def run_thread(self):
        while True:
            if self.thread_type == "train" and not isLearned:
                self.env_run()
            elif self.thread_type == "train" and isLearned:
                sleep(3)
                break
            elif self.thread_type == "test" and not isLearned:
                sleep(3)
            elif self.thread_type == "test" and isLearned:
                self.env_run()
                break

    def env_run(self):
        global isLearned
        global frame
        self.total_trial += 1

        step = 0
        observation = self.env.reset()

        while True:
            step += 1
            frame += 1
　　　　　　　　
　　　　　　　#Action choice
            if self.thread_type == "train":
                action=self.agent.greedy_action(observation)
            elif self.thread_type == "test":
                self.env.render()
                sleep(0.01)
                action=self.agent.action(observation)
            
            next_observation,_,done,_ = self.env.step(action)

            reward = 0

            if done:
                if step >= 199:
                    reward = 1 #At the time of success
                else:
                    reward =- 1　#At the time of failure
            else:
                #When it's not over
                reward+=0
            
　　　　　　　#Save the result in memory
            self.memory.append([observation,action,reward,done,next_observation])

            observation = next_observation

            if done:
                break

　　　　　#Calculate the average score of 10 times
        self.leaning_memory = np.hstack((self.leaning_memory[1:],step))
        print("Thread:",self.name," Thread_trials:",self.total_trial," score:",step," mean_score:",self.leaning_memory.mean()," total_step:",frame)

　　　　　#At the end of learning
        if self.leaning_memory.mean() >= 199:
            isLearned = True
            sleep(3)
        else:
            #Parameter update
            self.agent.update(self.memory)
            self.memory = []

ppo_agent

class ppo_agent:
    def __init__(self,brain):
        self.brain=brain
        self.memory=[]

    #Act without random elements
    def action(self,state):
        prob,v = self.brain.predict(state)
        return np.random.choice(ACTION_LIST,p = prob)

    #Randomly act with a certain probability
    def greedy_action(self,state):
        if frame >= EPS_STEPS:   
            eps = EPS_END
        else:
            eps = EPS_START + frame* (EPS_END - EPS_START) / EPS_STEPS  

        if np.random.random() <= eps:
            return np.random.choice(ACTION_LIST)
        else:
            return self.action(state)

    #Process the search result and ppo_Send to brain class
    def update(self,memory):
        R = sum([memory[j][2] * (GAMMA ** j) for j in range(ADVANTAGE + 1)])
        self.memory.append([memory[0][0], memory[0][1], R,memory[0][3], memory[0][4], GAMMA ** ADVANTAGE])

        #Consider the advantage
        for i in range(1, len(memory) - ADVANTAGE):
            R = ((R - memory[i-1][2]) / GAMMA) + memory[i + ADVANTAGE][2] * (GAMMA ** (ADVANTAGE - 1))
            self.memory.append([memory[i][0], memory[i][1], R,memory[i + ADVANTAGE][3], memory[i][4],GAMMA ** ADVANTAGE])
            
        for i in range(ADVANTAGE - 1):
            R = ((R - memory[len(memory) - ADVANTAGE + i][2]) / GAMMA)
            self.memory.append([memory[i][0], memory[i][1], R, True, memory[i][4], GAMMA ** (ADVANTAGE - i)])

        #ppo_Send data to brain class to update
        self.brain.update(self.memory)

        self.memory = []

ppo_brain

class ppo_brain:
    def __init__(self):
        self.build_model()
        self.name="brain"
        self.prob_old=1.0

    def build_model(self):
        self.input=tf.placeholder(dtype=tf.float32,shape=[None,STATE_NUM])
       
        #Here, the model of the old parameter and the model of the new parameter are defined, and the action probability and the state value are output for the same input.
        #New network
        with tf.variable_scope("current_brain"):
            hidden1=tf.layers.dense(self.input,HIDDEN_LAYERE,activation=tf.nn.leaky_relu)
            self.prob=tf.layers.dense(hidden1,ACTION_NUM,activation=tf.nn.softmax)
            self.v=tf.layers.dense(hidden1,1)
        #Old network
        with tf.variable_scope("old_brain"):
            old_hiddend1=tf.layers.dense(self.input,HIDDEN_LAYERE,activation=tf.nn.leaky_relu)
            self.old_prob=tf.layers.dense(hidden1,ACTION_NUM,activation=tf.nn.softmax)
            self.old_v=tf.layers.dense(hidden1,1)

        self.reward=tf.placeholder(dtype=tf.float32,shape=(None,1))
        self.action=tf.placeholder(dtype=tf.float32,shape=(None,ACTION_NUM))

###########Below is the definition of the loss function############
        #Definition of advantage function
        advantage = self.reward-self.v

        #Definition part of the loss function of the policy
        r_theta = tf.div(self.prob + 1e-10, tf.stop_gradient(self.old_prob) + 1e-10)
        action_theta = tf.reduce_sum(tf.multiply(r_theta, self.action), axis = 1, keepdims = True)
        #Calculate clip of r
        r_clip = tf.clip_by_value(action_theta, 1 - EPSIRON, 1 + EPSIRON)
        #When using the advantage function as the objective function of the policy, the gradient is not considered, so stop_Use gradient
        advantage_cpi = tf.multiply(action_theta, tf.stop_gradient(advantage))
        advantage_clip = tf.multiply(r_clip , tf.stop_gradient(advantage))
        self.policy_loss = tf.minimum(advantage_clip , advantage_cpi)

        #State value loss function
        self.value_loss = tf.square(advantage)

        #Definition of entropy
        self.entropy = tf.reduce_sum(self.prob*tf.log(self.prob+1e-10),axis = 1,keepdims = True)

        #Definition of the final loss function
        self.loss = tf.reduce_sum(-self.policy_loss + LOSS_V * self.value_loss - LOSS_ENTROPY * self.entropy)

##############The following defines the actions required for updating##############

        #Parameter update (minimized using Adam)
        self.opt = tf.train.AdamOptimizer(learning_rate = LEARNING_RATE)
        self.minimize = self.opt.minimize(self.loss)

        #Get new and old parameters from their respective networks
        self.weight_param = tf.get_collection(key = tf.GraphKeys.TRAINABLE_VARIABLES, scope = "current_brain")
        self.old_weight_param = tf.get_collection(key = tf.GraphKeys.TRAINABLE_VARIABLES, scope = "old_brain")

        #Substitute new network parameters for old network parameters
        self.insert = [g_p.assign(l_p) for l_p,g_p in zip(self.weight_param,self.old_weight_param)]

    #Output action probability and state value from state
    def predict(self,state):
        state=np.array(state).reshape(-1,STATE_NUM)
        feed_dict={self.input:state}
        p,v=SESS.run([self.prob,self.v],feed_dict)
        return p.reshape(-1),v.reshape(-1)
    
    #Create a batch by preprocessing before entering data
    #Update
    def update(self,memory):
        length=len(memory)
       
        s_=np.array([memory[j][0] for j in range(length)]).reshape(-1,STATE_NUM)
        a_=np.eye(ACTION_NUM)[[memory[j][1] for j in range(length)]].reshape(-1,ACTION_NUM)
        R_=np.array([memory[j][2] for j in range(length)]).reshape(-1,1)
        d_=np.array([memory[j][3] for j in range(length)]).reshape(-1,1)
        s_mask=np.array([memory[j][5] for j in range(length)]).reshape(-1,1)
        _s=np.array([memory[j][4] for j in range(length)]).reshape(-1,STATE_NUM)

        #Infer the later state value
        _, v=self.predict(_s)
        #Calculate rewards considering advantage
        R=(np.where(d_,0,1)*v.reshape(-1,1))*s_mask+R_
        #Parameter update
        feed_dict={self.input:s_, self.action:a_, self.reward:R}

        SESS.run(self.minimize,feed_dict)
        #Network update
        SESS.run(self.insert)

Whole code

The whole code looks like this:

import argparse
import tensorflow as tf
import numpy as np
import random
import threading
import gym
from time import sleep
from gym import wrappers
from os import path

parser=argparse.ArgumentParser(description="Reiforcement training with PPO",add_help=True)
parser.add_argument("--model",type=str,required=True,help="model base name. required")
parser.add_argument("--env_name",default="CartPole-v0",help="environment name. default is CartPole-v0")
parser.add_argument("--save",action="store_true",default=False,help="save command")
parser.add_argument("--load",action="store_true",default=False,help="load command")
parser.add_argument("--thread_num",type=int,default=5)
parser.add_argument("--video",action="store_true",default=False, help="write this if you want to save as video")
args=parser.parse_args()


ENV_NAME=args.env_name
WORKER_NUM=args.thread_num
#define constants
VIDEO_DIR="./train_info/video"
MODEL_DIR="./train_info/models"
MODEL_SAVE_PATH=path.join(MODEL_DIR,args.model)

ADVANTAGE=2
STATE_NUM=4
ACTION_LIST=[0,1]
ACTION_NUM=2
#epsiron parameter
EPS_START = 0.5
EPS_END = 0.1
EPS_STEPS = 200 * WORKER_NUM**2
#learning parameter
GAMMA=0.99
LEARNING_RATE=0.002
#loss constants
LOSS_V=0.5
LOSS_ENTROPY=0.02
HIDDEN_LAYERE=30

EPSIRON = 0.2

class ppo_brain:
    def __init__(self):
        self.build_model()
        self.name="brain"
        self.prob_old=1.0

    def build_model(self):
        self.input=tf.placeholder(dtype=tf.float32,shape=[None,STATE_NUM])
       
        #Here, the model of the old parameter and the model of the new parameter are defined, and the action probability and the state value are output for the same input.
        with tf.variable_scope("current_brain"):
            hidden1=tf.layers.dense(self.input,HIDDEN_LAYERE,activation=tf.nn.leaky_relu)
            self.prob=tf.layers.dense(hidden1,ACTION_NUM,activation=tf.nn.softmax)
            self.v=tf.layers.dense(hidden1,1)
        with tf.variable_scope("old_brain"):
            old_hiddend1=tf.layers.dense(self.input,HIDDEN_LAYERE,activation=tf.nn.leaky_relu)
            self.old_prob=tf.layers.dense(hidden1,ACTION_NUM,activation=tf.nn.softmax)
            self.old_v=tf.layers.dense(hidden1,1)

        self.reward=tf.placeholder(dtype=tf.float32,shape=(None,1))
        self.action=tf.placeholder(dtype=tf.float32,shape=(None,ACTION_NUM))

###########Below is the definition of the loss function############
        #Definition of advantage function
        advantage = self.reward-self.v

        #Definition part of the loss function of the policy
        r_theta = tf.div(self.prob + 1e-10, tf.stop_gradient(self.old_prob) + 1e-10)
        action_theta = tf.reduce_sum(tf.multiply(r_theta, self.action), axis = 1, keepdims = True)
        r_clip = tf.clip_by_value(action_theta, 1 - EPSIRON, 1 + EPSIRON)
        advantage_cpi = tf.multiply(action_theta, tf.stop_gradient(advantage))
        advantage_clip = tf.multiply(r_clip , tf.stop_gradient(advantage))
        self.policy_loss = tf.minimum(advantage_clip , advantage_cpi)

        #State value loss function
        self.value_loss = tf.square(advantage)

        #Definition of entropy
        self.entropy = tf.reduce_sum(self.prob*tf.log(self.prob+1e-10),axis = 1,keepdims = True)

        #Definition of the final loss function
        self.loss = tf.reduce_sum(-self.policy_loss + LOSS_V * self.value_loss - LOSS_ENTROPY * self.entropy)

##############The following defines the actions required for updating##############

        #Parameter update (minimized using Adam)
        self.opt = tf.train.AdamOptimizer(learning_rate = LEARNING_RATE)
        self.minimize = self.opt.minimize(self.loss)

        #Get new and old parameters from their respective networks
        self.weight_param = tf.get_collection(key = tf.GraphKeys.TRAINABLE_VARIABLES, scope = "current_brain")
        self.old_weight_param = tf.get_collection(key = tf.GraphKeys.TRAINABLE_VARIABLES, scope = "old_brain")

        #Substitute new network parameters for old network parameters
        self.insert = [g_p.assign(l_p) for l_p,g_p in zip(self.weight_param,self.old_weight_param)]

    #Output action probability and state value from state
    def predict(self,state):
        state=np.array(state).reshape(-1,STATE_NUM)
        feed_dict={self.input:state}
        p,v=SESS.run([self.prob,self.v],feed_dict)
        return p.reshape(-1),v.reshape(-1)
    
    #Create a batch by preprocessing before entering data
    #Update
    def update(self,memory):
        length=len(memory)
       
        s_=np.array([memory[j][0] for j in range(length)]).reshape(-1,STATE_NUM)
        a_=np.eye(ACTION_NUM)[[memory[j][1] for j in range(length)]].reshape(-1,ACTION_NUM)
        R_=np.array([memory[j][2] for j in range(length)]).reshape(-1,1)
        d_=np.array([memory[j][3] for j in range(length)]).reshape(-1,1)
        s_mask=np.array([memory[j][5] for j in range(length)]).reshape(-1,1)
        _s=np.array([memory[j][4] for j in range(length)]).reshape(-1,STATE_NUM)

        #Infer the later state value
        _, v=self.predict(_s)
        #Calculate rewards considering advantage
        R=(np.where(d_,0,1)*v.reshape(-1,1))*s_mask+R_
        #Parameter update
        feed_dict={self.input:s_, self.action:a_, self.reward:R}

        SESS.run(self.minimize,feed_dict)
        #Network update
        SESS.run(self.insert)


class ppo_agent:
    def __init__(self,brain):
        self.brain=brain
        self.memory=[]

    #Act without random elements
    def action(self,state):
        prob,v = self.brain.predict(state)
        return np.random.choice(ACTION_LIST,p = prob)

    #Randomly act with a certain probability
    def greedy_action(self,state):
        if frame >= EPS_STEPS:   
            eps = EPS_END
        else:
            eps = EPS_START + frame* (EPS_END - EPS_START) / EPS_STEPS  

        if np.random.random() <= eps:
            return np.random.choice(ACTION_LIST)
        else:
            return self.action(state)

    #Process the search result and ppo_Send to brain class
    def update(self,memory):
        R = sum([memory[j][2] * (GAMMA ** j) for j in range(ADVANTAGE + 1)])
        self.memory.append([memory[0][0], memory[0][1], R,memory[0][3], memory[0][4], GAMMA ** ADVANTAGE])

        #Consider the advantage
        for i in range(1, len(memory) - ADVANTAGE):
            R = ((R - memory[i-1][2]) / GAMMA) + memory[i + ADVANTAGE][2] * (GAMMA ** (ADVANTAGE - 1))
            self.memory.append([memory[i][0], memory[i][1], R,memory[i + ADVANTAGE][3], memory[i][4],GAMMA ** ADVANTAGE])
            
        for i in range(ADVANTAGE - 1):
            R = ((R - memory[len(memory) - ADVANTAGE + i][2]) / GAMMA)
            self.memory.append([memory[i][0], memory[i][1], R, True, memory[i][4], GAMMA ** (ADVANTAGE - i)])

        #ppo_Send data to brain class to update
        self.brain.update(self.memory)

        self.memory = []


class Worker:
    def __init__(self,thread_type,thread_name,brain):
        self.thread_type = thread_type
        self.name = thread_name
        self.agent = ppo_agent(brain)
        self.env = gym.make(ENV_NAME)
        #Save the video at the time of test
        if self.thread_type == "test" and args.video:
            self.env = wrappers.Monitor(self.env, VIDEO_DIR, force = True)
        self.leaning_memory = np.zeros(10)
        self.memory = []
        self.total_trial = 0

    def run_thread(self):
        while True:
            if self.thread_type == "train" and not isLearned:
                self.env_run()
            elif self.thread_type == "train" and isLearned:
                sleep(3)
                break
            elif self.thread_type == "test" and not isLearned:
                sleep(3)
            elif self.thread_type == "test" and isLearned:
                self.env_run()
                break

    def env_run(self):
        global isLearned
        global frame
        self.total_trial += 1

        step = 0
        observation = self.env.reset()

        while True:
            step += 1
            frame += 1

            if self.thread_type == "train":
                action=self.agent.greedy_action(observation)
            elif self.thread_type == "test":
                self.env.render()
                sleep(0.01)
                action=self.agent.action(observation)
            
            next_observation,_,done,_ = self.env.step(action)

            reward = 0

            if done:
                if step >= 199:
                    reward = 1 #At the time of success
                else:
                    reward =- 1　#At the time of failure
            else:
                #When it's not over
                reward+=0
            
　　　　　　　#Save the result in memory
            self.memory.append([observation,action,reward,done,next_observation])

            observation = next_observation

            if done:
                break

　　　　　#Calculate the average score of 10 times
        self.leaning_memory = np.hstack((self.leaning_memory[1:],step))
        print("Thread:",self.name," Thread_trials:",self.total_trial," score:",step," mean_score:",self.leaning_memory.mean()," total_step:",frame)

　　　　　#At the end of learning
        if self.leaning_memory.mean() >= 199:
            isLearned = True
            sleep(3)
        else:
            #Parameter update
            self.agent.update(self.memory)
            self.memory = []

def main(args):
    #Process to create a thread
    with tf.device("/cpu:0"):
        brain = ppo_brain()
        thread=[]
        for i in range(WORKER_NUM):
            thread_name = "local_thread"+str(i)
            thread.append(Worker(thread_type = "train",thread_name = thread_name,brain = brain))
    
    COORD = tf.train.Coordinator()
    SESS.run(tf.global_variables_initializer())
    saver = tf.train.Saver()
    
　　#Load the previous training process, basically do this after defining the model
    if args.load:
        ckpt = tf.train.get_checkpoint_state(MODEL_DIR)
        if ckpt:
            saver.restore(SESS,MODEL_SAVE_PATH)

    runnning_thread=[]
    for worker in thread:
        job = lambda: worker.run_thread()
        t = threading.Thread(target=job)
        t.start()
        runnning_thread.append(t)
    COORD.join(runnning_thread)
　　　
    #Do a test when learning is over
    test = Worker(thread_type = "test",thread_name = "test_thread",brain=brain)
    test.run_thread()

    if args.save:
        saver.save(SESS,MODEL_SAVE_PATH)
        print("saved")

if __name__=="__main__":
    SESS=tf.Session()
    frame=0
    isLearned=False
    
    main(args)

print("end")

Summary

That's why I implemented PPO this time. It seems that a major feature of PPO is that it produces high results despite its simple mechanism. I did a little research on TRPO, but the mechanism seemed to be quite difficult, so I will omit a detailed explanation this time. Next, I would like to summarize the implementation of PPO in the continuous action space or other methods.

I tried to implement PPO in Python

About PPO

Comparison with other methods

About A3C

A3C objective function

About PPO

Implementation

Overview of processing

Whole code

Summary