It seems that the next method of R2D2 called R2D3 has been announced. I was curious, so I implemented it.

Whole code

The code created in this article is below.

github

About R2D3
Implementation of demonstration environment
DemoReplay memory implementation
About Episode Memory (Original implementation)
Comparison with conventional method

Introduction

R2D3 is a so-called DQN series reinforcement learning method. The technical explanations up to that point are explained in the following series, so please feel free to contact us.

About R2D3

With the reinforcement learning method announced by Google DeepMind in September 2019, Roughly speaking, it is a combination of R2D2 and DQfD.

Roughly speaking, DQfD does better learning (DQN base) by referring to the play (demonstration) of a good person, and Ultimately, it's a way to learn better performance than a demonstration.

By the way, the name not abbreviated seems to be Recurrent Replay Distributed DQN from Demonstrations (R2D3).

·reference

Overall flow of learning

The above is the overall view of R2D3. (* Quoted from the paper) The right side of the figure (purple and blue part) is the same as R2D2, the difference is the red part on the left side.

First of all, as a premise, demo replay comes with reference play data in advance.

Up to R2D2, the batch data used for training was created for the batch size based on the agent replay data. R2D3 creates this batch data from demo replay and agent replay according to demo-ratio.

About demo-ratio

In the paper, we compared 1/16, 1/32, 1/64 1/128, 1/256 with fixed values. It was stated that 1/256 is the most accurate part of the task.

From here on, it's my opinion, but in the sense of a real person, demo play is helpful at first, but as you get used to it, you don't see it. So, in my implementation, I implemented the demo-ratio here so that it can be annealed. (Annealing will be the same as fixed if you change the setting)

Implementation of demonstration environment

First of all, we have to prepare the data for the demonstration. I created it by referring to the code for manual play provided by OpenAI below.

https://github.com/openai/gym/blob/master/gym/utils/play.py

Demo play data structure

The data structure to be saved is divided into two, one for learning and one for playback, and has the following form.

・ For learning (save for each frame)

name	Contents
action	action
observation	Status
reward	Reward
done	Whether it is finished

・ For playback (overall information)

name	Contents
episode	Episode number
rgb_size	Image size
states	Array of each frame information(Contains the following information)

・ For playback (save for each frame)

name	Contents
step	Frame number
reward_total	Current total reward
info	Frame info information(gym)
rgb	image

Adding demo play data to memory

(The code will be the add_memory function in env_play.py)

When adding demo play data to memory, you need to follow the same steps as the actual agent to store it. (Reference: [DQN (Rainbow) implementation explanation](https://qiita.com/pocokhc/items/408f0f818140924ad4c4#dqnrainbow%E3%81%AE%E5%AE%9F%E8%A3%85%E8%A7 % A3% E8% AA% AC))

It's a bit verbose, but we'll create the same mechanism separately and add it to memory. Below is the flow with pseudo code. (Because it is complicated, it is not described in the case of stateful LSTM)

`add_memory`


def add_memory(episode_file, memory, agent):
・ Episode_Get demo play information from file

  #Create variables for creating empirical data
  recent_actions =Array of number of actions to save
  recent_rewards =Array of rewards to save
  recent_rewards_multistep =For Multistep calculation
  recent_observations =Array of situations to save

for step in episode:
    observation =Frame information[step]["observation"]
    action      =Frame information[step]["action"]
    reward      =Frame information[step]["reward"]

    #Add status
    recent_observations.pop(0)
    recent_observations.append(observation)

    #Create an experience
    exp = (
      recent_observations[:agent.input_sequence],  #Previous state
      recent_actions[0],                           #Action in the previous state
      recent_rewards_multistep,                    #Reward
      recent_observations[-agent.input_sequence:]) #Next state
    )
    
    #Add experience to memory
    memory.add(exp)

    #Add action and reward
    recent_actions.pop(0)
    recent_actions.append(action)
    recent_rewards.pop(0)
    recent_rewards.append(reward)

    recent_rewards_multistep =Multi step learning calculation

Implementation of play environment

(The code will be the EpisodeSave class in env_play.py)

This is the class you actually play. It has the following functions.

Supports OpenAI gym
Supports customization by Processor
Actually play
Play with keyboard (any key can be set)
Resize
Pause/Unpause
fps change
frameadvance
Save the episode you played (cancellable for each episode)

The screen is as follows.

Execution code example

The code to execute looks like the following.

import gym
from src.env_play import EpisodeSave

def run_play():
  env = gym.make("MountainCar-v0")
  processor = None  #If there is, specify it arbitrarily

  es = EpisodeSave(
    env,
    episode_save_dir="tmp",
    processor=processor
  )
  es.play()
  env.close()

run_play()

Game key bindings

The key binding of the game can be specified by Processor. If the Processor has a get_keys_to_action method, it will be loaded.

`get_keys_to_action`


import rl
class MyProcessor(rl.core.Processor):
  def get_keys_to_action(self):
    return {
      ():0,           #0 if not pressed
      (ord('d'),):1,  #d key is 1
      (ord('a'),):2,  #a key is 2
    }

Playback of saved play data

(The code will be the EpisodeReplay class in env_play.py)

I also created a mechanism to play the episode saved by EpisodeSave. Mainly for confirmation.

Code execution example

from src.env_play import EpisodeReplay

def replay():
    r = EpisodeReplay(episode_save_dir="tmp")
    r.play()

replay()

DemoReplay memory implementation

Implementation on Rainbow

As usual, we will implement it from the Rainbow version, which is easy to understand without parallel processing.

Add the following new parameters.

name	Contents
demo_memory	Memory type(Similar to replay memory)
demo_episode_dir	Directory path saved by Episode Save above
demo_ratio_initial	initial rate of demo
demo_ratio_final	Demo final state rate
demo_ratio_steps	Number of steps to reach the final rate

demo_memory can be selected from ReplayMemory, PERGreedyMemory, PERProportionalMemory, PERRankBaseMemory as well as ReplayMemory I can do it.

Add demoplay experience to DemoReplay memory during the initialization phase.

`rainbow`


def __init__(self):
(abridgement)

  # add_demo with memory function_Added demo play to memory
  add_memory(demo_episode_dir, self.demo_memory, self)

  # demo_Set variables for ratio annealing
  self.demo_ratio_initial = demo_ratio_initial
  if demo_ratio_final is None:
    self.demo_ratio_final = self.demo_ratio_initial
  else:
    self.demo_ratio_final = demo_ratio_final
  self.demo_ratio_step = (self.demo_ratio_initial - self.demo_ratio_final) / demo_ratio_steps

(abridgement)

Get batch data from replay_memory and demo_memory. Then update the Priority.

`rainbow`


import random

def forward(self, observation):
  #It is the timing at the time of learning
(abridgement)

  #Calculate the ratio of demo ratio
  ratio_demo = self.demo_ratio_initial - self.local_step * self.demo_ratio_step
  if ratio_demo < self.demo_ratio_final:
    ratio_demo = self.demo_ratio_final

  #Calculate the number of batches according to the ratio
  batch_replay = 0
  batch_demo = 0
  for _ in range(self.batch_size):
    r = random.random()
    if r < ratio_demo:
      batch_demo += 1
      continue
    batch_replay += 1

  #Create batch based on ratio
  indexes = []
  batchs = []
  weights = []
  memory_types = []  #Save the acquired memory type
  if batch_replay > 0:
    (i, b, w) = self.memory.sample(batch_replay, self.local_step)
    indexes.extend(i)
    batchs.extend(b)
    weights.extend(w)
    #0 is replay_memory
    memory_types.extend([0 for _ in range(batch_replay)])
  if batch_demo > 0:
    (i, b, w) = self.demo_memory.sample(batch_demo, self.local_step)
    indexes.extend(i)
    batchs.extend(b)
    weights.extend(w)
    #1 is demo_memory
    memory_types.extend([1 for _ in range(batch_demo)])

(abridgement)

  for i in range(self.batch_size):
(Learning)

    #Update priority
    if memory_types[i] == 0:
      # replay_update memory
      self.memory.update(indexes[i], batchs[i], priority)
    elif memory_types[i] == 1:
      # demo_update memory
      self.demo_memory.update(indexes[i], batchs[i], priority)
    else:
      assert False
  
(abridgement)

Implementation on R2D3

It can be implemented in exactly the same way as Rainbow. Just implement it on the Learner side.

EpisodeMemory

This is my own implementation. I came up with the idea after seeing the demo play, but if the demo play is meaningful, why not get this while learning? I thought.

Specifically, the play with the highest total episode reward is saved separately in memory, The idea is to mix this into a batch in the same way as this time.

As an image, it feels like I remember the play that happened to be successful and review it many times.

Below is the implementation.

Definition of EpisodeMemory

Create an EpisodeMemory that includes ReplayMemory. This is a wrapper class that adds experience to ReplayMemory on a per-episode basis.

`EpisodeMemory`


class EpisodeMemory():
  def __init__(self, memory):
    self.max_reward = None
    self.memory = memory
  
  def add_episode(self, episode, total_reward):
    # max_Add episode to memory when reward is updated
    if self.max_reward is None:
      self.max_reward = total_reward
    elif self.max_reward <= total_reward:  #Add to memory even if it is in the same row
      self.max_reward = total_reward
    else:
      return
    
    #Actual memory additional processing
    for e in episode_recent:
      if len(e) == 5:  #Processing when there is priority
        self.memory.add(e, e[4])
      else:
        self.memory.add(e)

Implementation of EpisodeMemory

Implementation on Rainbow

Additional parameters are as follows. Unlike the demo play, it is not annealed.

name	Contents
episode_memory	Memory type(Similar to replay memory)
episode_ratio	EpisodeMemory rate

Initialization of Episode Memory

`rainbow`


from src.memory.EpisodeMemory import EpisodeMemory

def __init__(self):
(abridgement)

  #Wrap in EpisodeMemory class
  self.episode_memory = EpisodeMemory(episode_memory)
  self.episode_ratio = episode_ratio

(abridgement)

Save experience for episode

`reset_states`


#Called at the beginning of the episode
def reset_states(self):
(abridgement)

  #For saving episode experience
  self.episode_exp = []
  self.total_reward = 0

  #For checking the end status
  self.recent_terminal = False

`forward`


#Called before action execution in each step
def forward(self, observation):
(abridgement)

  #If finished, episode_Add to memory
  if self.recent_terminal:
    self.episode_memory.add_episode(self.episode_exp, self.total_reward)

(abridgement)

  exp =(Create experience data)
  self.memory.add(exp)   # replay_Add memory
  self.episode_exp.append(exp)  # episode_Add experience for memory

(abridgement)

`backward`


#Called after action is executed in each step
def backward(self, reward, terminal):
(abridgement)

  #Calculate total experience
  self.total_reward += reward

  #Save exit status
  self.recent_terminal = terminal

(abridgement)

Mix episode_memory in batch data creation

`rainbow`


import random

def forward(self, observation):
  #It is the timing at the time of learning
(abridgement)

  ratio_demo =(Calculation of demo ratio)
  
  # episode_If memory has memory, mix it in batch
  if len(self.episode_memory) < self.batch_size:
    ratio_epi = 0
  else:
    ratio_epi = self.episode_ratio

  #Calculate the number of batches according to the ratio
  batch_replay = 0
  batch_demo = 0
  batch_episode = 0
  for _ in range(self.batch_size):
    r = random.random()
    if r < ratio_demo:
      batch_demo += 1
      continue
    r -= ratio_demo
    if r < ratio_epi:
      batch_episode += 1
      continue
    batch_replay += 1

  #Create batch based on ratio
  indexes = []
  batchs = []
  weights = []
  memory_types = []  #Save the acquired memory type
  if batch_replay > 0:
    （replay_batch creation of memory)
  if batch_demo > 0:
    （demo_batch creation of memory)
  if batch_episode > 0:
    (i, b, w) = self.episode_memory.sample(batch_episode, self.local_step)
    indexes.extend(i)
    batchs.extend(b)
    weights.extend(w)
    # episode_memory is 2
    memory_types.extend([2 for _ in range(batch_episode)])

(abridgement)

  for i in range(self.batch_size):
(Learning)

    #Update priority
    if memory_types[i] == 0:
      （replay_Update memory)
    elif memory_types[i] == 1:
      （demo_Update memory)
    elif memory_types[i] == 2:
      # episode_update memory
      self.episode_memory.update(indexes[i], batchs[i], priority)
    else:
      assert False
  
(abridgement)

Implementation on R2D3

It is almost the same as the implementation on Rainbow. However, episode data is created for each actor, It is managed on the Learner side to reduce the amount of interprocess communication.

`Learner`


class Learner():
  def __init__():
(abridgement)

    #Create variables for episode management for each Actor by initializing Learner
    self.episode_exp = [ [] for _ in range(self.actors_num)]
    self.total_reward = [ 0 for _ in range(self.actors_num)]

  def train(self):
(abridgement)

    #Actor → Add experience to Learner
    for _ in range(self.exp_q.qsize()):
      exp = self.exp_q.get(timeout=1)

      # add memory
      self.memory.add(exp[0], exp[0][4])

      # add episode_exp
      self.total_reward[exp[1]] += exp[0][2]
      self.episode_exp[exp[1]].append(exp[0])
      
      if exp[2]:  # terminal
          self.episode_memory.add_episode(
              self.episode_exp[exp[1]],
              self.total_reward[exp[1]]
          )
          self.episode_exp[exp[1]] = []
          self.total_reward[exp[1]] = 0

(abridgement)

`Actor`


class Actor():
  def forward(self, observation):
(abridgement)

    #Send to Learner
    # actor_Also pass index and terminal information
    self.exp_q.put((exp, self.actor_index, self.recent_terminal))

(abridgement)

Comparison result with the conventional method

MountainCar

This time I will try it with MountainCar.

Mountain Car is a game where you move the car left and right to aim for the flag in the upper right.

The reward is always -1. In short, the sooner you reach the flag, the higher your score.

Thinking of it as Q-learning, it is a task that you cannot get rewards until you reach the goal (it is unknown whether it is good or bad), and it is a rather difficult task.

Traditional results

Processor is a pure MountainCar training provided by the gym without definition. The log is acquired every 2000 steps. (This is the operation in Rainbow)

The first 50,000 steps are warmed up and then 100,000 times are learned. ·result

It's hard to learn until the play that reached the goal is stored in memory to some extent. The results are starting to come out around 130,000 steps.

Results when DemoReplay memory is enabled

The parameters other than the DemoReplay memory are the same as the previous results. The demo has prepared only one episode below.

The parameters of DemoReplay memory are as follows.

Uses Proportional memory
IS off
Rate anneals from 1.0 to 1/512

demo_memory = PERProportionalMemory(100_000, alpha=0.8)
demo_episode_dir = episode_save_dir
demo_ratio_initial = 1.0
demo_ratio_final = 1.0/512.0
demo_ratio_steps = warmup + 50_000

·result

Results have already begun to appear around 70,000.

Results when Episode Memory is enabled

The parameters other than EpisodeMemory are the same as the previous results. The parameters of EpisodeMemory are as follows.

Uses Proportional memory
IS off
The size of the memory is small (about a few episodes)
The rate is intentionally set higher to strengthen the influence of EpisodeMemory.

episode_memory = PERProportionalMemory(2_000, alpha=0.8),
episode_ratio = 1.0/8.0,

·result

Results are starting to come out around 80,000.

Bonus (both Demo Replay memory and Episode Memory are valid)

Bonus 2 (R2D3)

Both DemoReplay memory and EpisodeMemory are valid. (The x-axis does not include the number of warmups)

Afterword

It was easier to implement than I expected. There are not many tasks without demo play for learning, so I think it is a very effective method. The evolution of reinforcement learning hasn't stopped yet. I'm looking forward to seeing what comes next.

[Reinforcement learning] I implemented / explained R2D3 (Keras-RL)

Whole code

table of contents

Introduction

About R2D3

Overall flow of learning

About demo-ratio

Implementation of demonstration environment

Demo play data structure

Adding demo play data to memory

add_memory

Implementation of play environment

Execution code example

Game key bindings

get_keys_to_action

Playback of saved play data

Code execution example

DemoReplay memory implementation

Implementation on Rainbow

rainbow

rainbow

Implementation on R2D3

Definition of EpisodeMemory

EpisodeMemory

Implementation of EpisodeMemory

Implementation on Rainbow

rainbow

reset_states

forward

backward

rainbow

Implementation on R2D3

Learner

Actor

Comparison result with the conventional method

Traditional results

Results when DemoReplay memory is enabled

Results when Episode Memory is enabled

Bonus (both Demo Replay memory and Episode Memory are valid)

Bonus 2 (R2D3)

Afterword

`add_memory`

`get_keys_to_action`

`rainbow`

`rainbow`

`EpisodeMemory`

`rainbow`

`reset_states`

`forward`

`backward`

`rainbow`

`Learner`

`Actor`