[Reinforcement learning] I implemented / explained R2D3 (Keras-RL)

It seems that the next method of R2D2 called R2D3 has been announced. I was curious, so I implemented it.

Whole code

The code created in this article is below.

table of contents

Introduction

R2D3 is a so-called DQN series reinforcement learning method. The technical explanations up to that point are explained in the following series, so please feel free to contact us.

About R2D3

With the reinforcement learning method announced by Google DeepMind in September 2019, Roughly speaking, it is a combination of R2D2 and DQfD.

Roughly speaking, DQfD does better learning (DQN base) by referring to the play (demonstration) of a good person, and Ultimately, it's a way to learn better performance than a demonstration.

By the way, the name not abbreviated seems to be Recurrent Replay Distributed DQN from Demonstrations (R2D3).

·reference

Overall flow of learning

20191012205515.png

The above is the overall view of R2D3. (* Quoted from the paper) The right side of the figure (purple and blue part) is the same as R2D2, the difference is the red part on the left side.

First of all, as a premise, demo replay comes with reference play data in advance.

Up to R2D2, the batch data used for training was created for the batch size based on the agent replay data. R2D3 creates this batch data from demo replay and agent replay according to demo-ratio.

About demo-ratio

In the paper, we compared 1/16, 1/32, 1/64 1/128, 1/256 with fixed values. It was stated that 1/256 is the most accurate part of the task.

From here on, it's my opinion, but in the sense of a real person, demo play is helpful at first, but as you get used to it, you don't see it. So, in my implementation, I implemented the demo-ratio here so that it can be annealed. (Annealing will be the same as fixed if you change the setting)

Implementation of demonstration environment

First of all, we have to prepare the data for the demonstration. I created it by referring to the code for manual play provided by OpenAI below.

Demo play data structure

The data structure to be saved is divided into two, one for learning and one for playback, and has the following form.

・ For learning (save for each frame)

name Contents
action action
observation Status
reward Reward
done Whether it is finished

・ For playback (overall information)

name Contents
episode Episode number
rgb_size Image size
states Array of each frame information(Contains the following information)

・ For playback (save for each frame)

name Contents
step Frame number
reward_total Current total reward
info Frame info information(gym)
rgb image

Adding demo play data to memory

(The code will be the add_memory function in env_play.py)

When adding demo play data to memory, you need to follow the same steps as the actual agent to store it. (Reference: [DQN (Rainbow) implementation explanation](https://qiita.com/pocokhc/items/408f0f818140924ad4c4#dqnrainbow%E3%81%AE%E5%AE%9F%E8%A3%85%E8%A7 % A3% E8% AA% AC))

It's a bit verbose, but we'll create the same mechanism separately and add it to memory. Below is the flow with pseudo code. (Because it is complicated, it is not described in the case of stateful LSTM)

add_memory


def add_memory(episode_file, memory, agent):
・ Episode_Get demo play information from file

  #Create variables for creating empirical data
  recent_actions =Array of number of actions to save
  recent_rewards =Array of rewards to save
  recent_rewards_multistep =For Multistep calculation
  recent_observations =Array of situations to save

for step in episode:
    observation =Frame information[step]["observation"]
    action      =Frame information[step]["action"]
    reward      =Frame information[step]["reward"]

    #Add status
    recent_observations.pop(0)
    recent_observations.append(observation)

    #Create an experience
    exp = (
      recent_observations[:agent.input_sequence],  #Previous state
      recent_actions[0],                           #Action in the previous state
      recent_rewards_multistep,                    #Reward
      recent_observations[-agent.input_sequence:]) #Next state
    )
    
    #Add experience to memory
    memory.add(exp)

    #Add action and reward
    recent_actions.pop(0)
    recent_actions.append(action)
    recent_rewards.pop(0)
    recent_rewards.append(reward)

    recent_rewards_multistep =Multi step learning calculation

Implementation of play environment

(The code will be the EpisodeSave class in env_play.py)

This is the class you actually play. It has the following functions.

The screen is as follows.

play1.PNG

Execution code example

The code to execute looks like the following.

import gym
from src.env_play import EpisodeSave

def run_play():
  env = gym.make("MountainCar-v0")
  processor = None  #If there is, specify it arbitrarily

  es = EpisodeSave(
    env,
    episode_save_dir="tmp",
    processor=processor
  )
  es.play()
  env.close()

run_play()

Game key bindings

The key binding of the game can be specified by Processor. If the Processor has a get_keys_to_action method, it will be loaded.

get_keys_to_action


import rl
class MyProcessor(rl.core.Processor):
  def get_keys_to_action(self):
    return {
      ():0,           #0 if not pressed
      (ord('d'),):1,  #d key is 1
      (ord('a'),):2,  #a key is 2
    }

Playback of saved play data

(The code will be the EpisodeReplay class in env_play.py)

I also created a mechanism to play the episode saved by EpisodeSave. Mainly for confirmation.

Code execution example

from src.env_play import EpisodeReplay

def replay():
    r = EpisodeReplay(episode_save_dir="tmp")
    r.play()

replay()

DemoReplay memory implementation

Implementation on Rainbow

As usual, we will implement it from the Rainbow version, which is easy to understand without parallel processing.

  1. Add the following new parameters.
name Contents
demo_memory Memory type(Similar to replay memory)
demo_episode_dir Directory path saved by Episode Save above
demo_ratio_initial initial rate of demo
demo_ratio_final Demo final state rate
demo_ratio_steps Number of steps to reach the final rate

demo_memory can be selected from ReplayMemory, PERGreedyMemory, PERProportionalMemory, PERRankBaseMemory as well as ReplayMemory I can do it.

  1. Add demoplay experience to DemoReplay memory during the initialization phase.

rainbow


def __init__(self):
(abridgement)

  # add_demo with memory function_Added demo play to memory
  add_memory(demo_episode_dir, self.demo_memory, self)

  # demo_Set variables for ratio annealing
  self.demo_ratio_initial = demo_ratio_initial
  if demo_ratio_final is None:
    self.demo_ratio_final = self.demo_ratio_initial
  else:
    self.demo_ratio_final = demo_ratio_final
  self.demo_ratio_step = (self.demo_ratio_initial - self.demo_ratio_final) / demo_ratio_steps

(abridgement)
  1. Get batch data from replay_memory and demo_memory. Then update the Priority.

rainbow


import random

def forward(self, observation):
  #It is the timing at the time of learning
(abridgement)

  #Calculate the ratio of demo ratio
  ratio_demo = self.demo_ratio_initial - self.local_step * self.demo_ratio_step
  if ratio_demo < self.demo_ratio_final:
    ratio_demo = self.demo_ratio_final

  #Calculate the number of batches according to the ratio
  batch_replay = 0
  batch_demo = 0
  for _ in range(self.batch_size):
    r = random.random()
    if r < ratio_demo:
      batch_demo += 1
      continue
    batch_replay += 1

  #Create batch based on ratio
  indexes = []
  batchs = []
  weights = []
  memory_types = []  #Save the acquired memory type
  if batch_replay > 0:
    (i, b, w) = self.memory.sample(batch_replay, self.local_step)
    indexes.extend(i)
    batchs.extend(b)
    weights.extend(w)
    #0 is replay_memory
    memory_types.extend([0 for _ in range(batch_replay)])
  if batch_demo > 0:
    (i, b, w) = self.demo_memory.sample(batch_demo, self.local_step)
    indexes.extend(i)
    batchs.extend(b)
    weights.extend(w)
    #1 is demo_memory
    memory_types.extend([1 for _ in range(batch_demo)])

(abridgement)

  for i in range(self.batch_size):
(Learning)

    #Update priority
    if memory_types[i] == 0:
      # replay_update memory
      self.memory.update(indexes[i], batchs[i], priority)
    elif memory_types[i] == 1:
      # demo_update memory
      self.demo_memory.update(indexes[i], batchs[i], priority)
    else:
      assert False
  
(abridgement)

Implementation on R2D3

It can be implemented in exactly the same way as Rainbow. Just implement it on the Learner side.

EpisodeMemory

This is my own implementation. I came up with the idea after seeing the demo play, but if the demo play is meaningful, why not get this while learning? I thought.

Specifically, the play with the highest total episode reward is saved separately in memory, The idea is to mix this into a batch in the same way as this time.

As an image, it feels like I remember the play that happened to be successful and review it many times.

Below is the implementation.

Definition of EpisodeMemory

Create an EpisodeMemory that includes ReplayMemory. This is a wrapper class that adds experience to ReplayMemory on a per-episode basis.

EpisodeMemory


class EpisodeMemory():
  def __init__(self, memory):
    self.max_reward = None
    self.memory = memory
  
  def add_episode(self, episode, total_reward):
    # max_Add episode to memory when reward is updated
    if self.max_reward is None:
      self.max_reward = total_reward
    elif self.max_reward <= total_reward:  #Add to memory even if it is in the same row
      self.max_reward = total_reward
    else:
      return
    
    #Actual memory additional processing
    for e in episode_recent:
      if len(e) == 5:  #Processing when there is priority
        self.memory.add(e, e[4])
      else:
        self.memory.add(e)

Implementation of EpisodeMemory

Implementation on Rainbow

  1. Additional parameters are as follows. Unlike the demo play, it is not annealed.
name Contents
episode_memory Memory type(Similar to replay memory)
episode_ratio EpisodeMemory rate
  1. Initialization of Episode Memory

rainbow


from src.memory.EpisodeMemory import EpisodeMemory

def __init__(self):
(abridgement)

  #Wrap in EpisodeMemory class
  self.episode_memory = EpisodeMemory(episode_memory)
  self.episode_ratio = episode_ratio

(abridgement)
  1. Save experience for episode

reset_states


#Called at the beginning of the episode
def reset_states(self):
(abridgement)

  #For saving episode experience
  self.episode_exp = []
  self.total_reward = 0

  #For checking the end status
  self.recent_terminal = False

forward


#Called before action execution in each step
def forward(self, observation):
(abridgement)

  #If finished, episode_Add to memory
  if self.recent_terminal:
    self.episode_memory.add_episode(self.episode_exp, self.total_reward)

(abridgement)

  exp =(Create experience data)
  self.memory.add(exp)   # replay_Add memory
  self.episode_exp.append(exp)  # episode_Add experience for memory

(abridgement)

backward


#Called after action is executed in each step
def backward(self, reward, terminal):
(abridgement)

  #Calculate total experience
  self.total_reward += reward

  #Save exit status
  self.recent_terminal = terminal

(abridgement)
  1. Mix episode_memory in batch data creation

rainbow


import random

def forward(self, observation):
  #It is the timing at the time of learning
(abridgement)

  ratio_demo =(Calculation of demo ratio)
  
  # episode_If memory has memory, mix it in batch
  if len(self.episode_memory) < self.batch_size:
    ratio_epi = 0
  else:
    ratio_epi = self.episode_ratio

  #Calculate the number of batches according to the ratio
  batch_replay = 0
  batch_demo = 0
  batch_episode = 0
  for _ in range(self.batch_size):
    r = random.random()
    if r < ratio_demo:
      batch_demo += 1
      continue
    r -= ratio_demo
    if r < ratio_epi:
      batch_episode += 1
      continue
    batch_replay += 1

  #Create batch based on ratio
  indexes = []
  batchs = []
  weights = []
  memory_types = []  #Save the acquired memory type
  if batch_replay > 0:
    (replay_batch creation of memory)
  if batch_demo > 0:
    (demo_batch creation of memory)
  if batch_episode > 0:
    (i, b, w) = self.episode_memory.sample(batch_episode, self.local_step)
    indexes.extend(i)
    batchs.extend(b)
    weights.extend(w)
    # episode_memory is 2
    memory_types.extend([2 for _ in range(batch_episode)])

(abridgement)

  for i in range(self.batch_size):
(Learning)

    #Update priority
    if memory_types[i] == 0:
      (replay_Update memory)
    elif memory_types[i] == 1:
      (demo_Update memory)
    elif memory_types[i] == 2:
      # episode_update memory
      self.episode_memory.update(indexes[i], batchs[i], priority)
    else:
      assert False
  
(abridgement)

Implementation on R2D3

It is almost the same as the implementation on Rainbow. However, episode data is created for each actor, It is managed on the Learner side to reduce the amount of interprocess communication.

Learner


class Learner():
  def __init__():
(abridgement)

    #Create variables for episode management for each Actor by initializing Learner
    self.episode_exp = [ [] for _ in range(self.actors_num)]
    self.total_reward = [ 0 for _ in range(self.actors_num)]

  def train(self):
(abridgement)

    #Actor → Add experience to Learner
    for _ in range(self.exp_q.qsize()):
      exp = self.exp_q.get(timeout=1)

      # add memory
      self.memory.add(exp[0], exp[0][4])

      # add episode_exp
      self.total_reward[exp[1]] += exp[0][2]
      self.episode_exp[exp[1]].append(exp[0])
      
      if exp[2]:  # terminal
          self.episode_memory.add_episode(
              self.episode_exp[exp[1]],
              self.total_reward[exp[1]]
          )
          self.episode_exp[exp[1]] = []
          self.total_reward[exp[1]] = 0

(abridgement)

Actor


class Actor():
  def forward(self, observation):
(abridgement)

    #Send to Learner
    # actor_Also pass index and terminal information
    self.exp_q.put((exp, self.actor_index, self.recent_terminal))

(abridgement)

Comparison result with the conventional method

MountainCar

This time I will try it with MountainCar.

mountaincar.gif

Mountain Car is a game where you move the car left and right to aim for the flag in the upper right.

The reward is always -1. In short, the sooner you reach the flag, the higher your score.

Thinking of it as Q-learning, it is a task that you cannot get rewards until you reach the goal (it is unknown whether it is good or bad), and it is a rather difficult task.

Traditional results

Processor is a pure MountainCar training provided by the gym without definition. The log is acquired every 2000 steps. (This is the operation in Rainbow)

The first 50,000 steps are warmed up and then 100,000 times are learned. ·result

Figure_1_replay2.png

It's hard to learn until the play that reached the goal is stored in memory to some extent. The results are starting to come out around 130,000 steps.

Results when DemoReplay memory is enabled

The parameters other than the DemoReplay memory are the same as the previous results. The demo has prepared only one episode below.

MountainCar-v0.gif

The parameters of DemoReplay memory are as follows.

demo_memory = PERProportionalMemory(100_000, alpha=0.8)
demo_episode_dir = episode_save_dir
demo_ratio_initial = 1.0
demo_ratio_final = 1.0/512.0
demo_ratio_steps = warmup + 50_000

·result

Figure_2_demo2.png

Results have already begun to appear around 70,000.

Results when Episode Memory is enabled

The parameters other than EpisodeMemory are the same as the previous results. The parameters of EpisodeMemory are as follows.

episode_memory = PERProportionalMemory(2_000, alpha=0.8),
episode_ratio = 1.0/8.0,

·result

Figure_3_episode3.png

Results are starting to come out around 80,000.

Bonus (both Demo Replay memory and Episode Memory are valid)

Figure_4_mix2.png

Bonus 2 (R2D3)

Both DemoReplay memory and EpisodeMemory are valid. (The x-axis does not include the number of warmups)

Figure_5_2.png

Afterword

It was easier to implement than I expected. There are not many tasks without demo play for learning, so I think it is a very effective method. The evolution of reinforcement learning hasn't stopped yet. I'm looking forward to seeing what comes next.

Recommended Posts

[Reinforcement learning] I implemented / explained R2D3 (Keras-RL)
[Reinforcement learning] R2D2 implementation / explanation revenge commentary (Keras-RL)
[Reinforcement learning] R2D2 implementation / explanation revenge hyperparameter explanation (Keras-RL)
I implemented Extreme learning machine
I tried reinforcement learning using PyBrain
[Python] Easy Reinforcement Learning (DQN) with Keras-RL
[Reinforcement learning] Finally surpassed humans! ?? I tried to explain / implement Agent57 (Keras-RL)
[Introduction] Reinforcement learning
I implemented CycleGAN (1)
Future reinforcement learning_2
Future reinforcement learning_1
I implemented ResNet!
[Mac] I tried reinforcement learning with OpenAI Baselines
I want to climb a mountain with reinforcement learning
I investigated the reinforcement learning algorithm of algorithmic trading
Reinforcement learning 1 Python installation
Reinforcement learning 3 OpenAI installation
Reinforcement learning for tic-tac-toe
Qiskit: I implemented VQE
I implemented Python Logging
I tried deep learning
[Reinforcement learning] Bandit task
Python + Unity Reinforcement Learning (Learning)
Reinforcement learning 1 introductory edition