It seems that the next method of R2D2 called R2D3 has been announced. I was curious, so I implemented it.
The code created in this article is below.
R2D3 is a so-called DQN series reinforcement learning method. The technical explanations up to that point are explained in the following series, so please feel free to contact us.
With the reinforcement learning method announced by Google DeepMind in September 2019, Roughly speaking, it is a combination of R2D2 and DQfD.
Roughly speaking, DQfD does better learning (DQN base) by referring to the play (demonstration) of a good person, and Ultimately, it's a way to learn better performance than a demonstration.
By the way, the name not abbreviated seems to be Recurrent Replay Distributed DQN from Demonstrations (R2D3).
·reference
The above is the overall view of R2D3. (* Quoted from the paper) The right side of the figure (purple and blue part) is the same as R2D2, the difference is the red part on the left side.
First of all, as a premise, demo replay comes with reference play data in advance.
Up to R2D2, the batch data used for training was created for the batch size based on the agent replay data. R2D3 creates this batch data from demo replay and agent replay according to demo-ratio.
In the paper, we compared 1/16, 1/32, 1/64 1/128, 1/256 with fixed values. It was stated that 1/256 is the most accurate part of the task.
From here on, it's my opinion, but in the sense of a real person, demo play is helpful at first, but as you get used to it, you don't see it. So, in my implementation, I implemented the demo-ratio here so that it can be annealed. (Annealing will be the same as fixed if you change the setting)
First of all, we have to prepare the data for the demonstration. I created it by referring to the code for manual play provided by OpenAI below.
The data structure to be saved is divided into two, one for learning and one for playback, and has the following form.
・ For learning (save for each frame)
name | Contents |
---|---|
action | action |
observation | Status |
reward | Reward |
done | Whether it is finished |
・ For playback (overall information)
name | Contents |
---|---|
episode | Episode number |
rgb_size | Image size |
states | Array of each frame information(Contains the following information) |
・ For playback (save for each frame)
name | Contents |
---|---|
step | Frame number |
reward_total | Current total reward |
info | Frame info information(gym) |
rgb | image |
(The code will be the add_memory function in env_play.py)
When adding demo play data to memory, you need to follow the same steps as the actual agent to store it. (Reference: [DQN (Rainbow) implementation explanation](https://qiita.com/pocokhc/items/408f0f818140924ad4c4#dqnrainbow%E3%81%AE%E5%AE%9F%E8%A3%85%E8%A7 % A3% E8% AA% AC))
It's a bit verbose, but we'll create the same mechanism separately and add it to memory. Below is the flow with pseudo code. (Because it is complicated, it is not described in the case of stateful LSTM)
add_memory
def add_memory(episode_file, memory, agent):
・ Episode_Get demo play information from file
#Create variables for creating empirical data
recent_actions =Array of number of actions to save
recent_rewards =Array of rewards to save
recent_rewards_multistep =For Multistep calculation
recent_observations =Array of situations to save
for step in episode:
observation =Frame information[step]["observation"]
action =Frame information[step]["action"]
reward =Frame information[step]["reward"]
#Add status
recent_observations.pop(0)
recent_observations.append(observation)
#Create an experience
exp = (
recent_observations[:agent.input_sequence], #Previous state
recent_actions[0], #Action in the previous state
recent_rewards_multistep, #Reward
recent_observations[-agent.input_sequence:]) #Next state
)
#Add experience to memory
memory.add(exp)
#Add action and reward
recent_actions.pop(0)
recent_actions.append(action)
recent_rewards.pop(0)
recent_rewards.append(reward)
recent_rewards_multistep =Multi step learning calculation
(The code will be the EpisodeSave class in env_play.py)
This is the class you actually play. It has the following functions.
The screen is as follows.
The code to execute looks like the following.
import gym
from src.env_play import EpisodeSave
def run_play():
env = gym.make("MountainCar-v0")
processor = None #If there is, specify it arbitrarily
es = EpisodeSave(
env,
episode_save_dir="tmp",
processor=processor
)
es.play()
env.close()
run_play()
The key binding of the game can be specified by Processor. If the Processor has a get_keys_to_action method, it will be loaded.
get_keys_to_action
import rl
class MyProcessor(rl.core.Processor):
def get_keys_to_action(self):
return {
():0, #0 if not pressed
(ord('d'),):1, #d key is 1
(ord('a'),):2, #a key is 2
}
(The code will be the EpisodeReplay class in env_play.py)
I also created a mechanism to play the episode saved by EpisodeSave. Mainly for confirmation.
from src.env_play import EpisodeReplay
def replay():
r = EpisodeReplay(episode_save_dir="tmp")
r.play()
replay()
As usual, we will implement it from the Rainbow version, which is easy to understand without parallel processing.
name | Contents |
---|---|
demo_memory | Memory type(Similar to replay memory) |
demo_episode_dir | Directory path saved by Episode Save above |
demo_ratio_initial | initial rate of demo |
demo_ratio_final | Demo final state rate |
demo_ratio_steps | Number of steps to reach the final rate |
demo_memory can be selected from ReplayMemory, PERGreedyMemory, PERProportionalMemory, PERRankBaseMemory as well as ReplayMemory I can do it.
rainbow
def __init__(self):
(abridgement)
# add_demo with memory function_Added demo play to memory
add_memory(demo_episode_dir, self.demo_memory, self)
# demo_Set variables for ratio annealing
self.demo_ratio_initial = demo_ratio_initial
if demo_ratio_final is None:
self.demo_ratio_final = self.demo_ratio_initial
else:
self.demo_ratio_final = demo_ratio_final
self.demo_ratio_step = (self.demo_ratio_initial - self.demo_ratio_final) / demo_ratio_steps
(abridgement)
rainbow
import random
def forward(self, observation):
#It is the timing at the time of learning
(abridgement)
#Calculate the ratio of demo ratio
ratio_demo = self.demo_ratio_initial - self.local_step * self.demo_ratio_step
if ratio_demo < self.demo_ratio_final:
ratio_demo = self.demo_ratio_final
#Calculate the number of batches according to the ratio
batch_replay = 0
batch_demo = 0
for _ in range(self.batch_size):
r = random.random()
if r < ratio_demo:
batch_demo += 1
continue
batch_replay += 1
#Create batch based on ratio
indexes = []
batchs = []
weights = []
memory_types = [] #Save the acquired memory type
if batch_replay > 0:
(i, b, w) = self.memory.sample(batch_replay, self.local_step)
indexes.extend(i)
batchs.extend(b)
weights.extend(w)
#0 is replay_memory
memory_types.extend([0 for _ in range(batch_replay)])
if batch_demo > 0:
(i, b, w) = self.demo_memory.sample(batch_demo, self.local_step)
indexes.extend(i)
batchs.extend(b)
weights.extend(w)
#1 is demo_memory
memory_types.extend([1 for _ in range(batch_demo)])
(abridgement)
for i in range(self.batch_size):
(Learning)
#Update priority
if memory_types[i] == 0:
# replay_update memory
self.memory.update(indexes[i], batchs[i], priority)
elif memory_types[i] == 1:
# demo_update memory
self.demo_memory.update(indexes[i], batchs[i], priority)
else:
assert False
(abridgement)
It can be implemented in exactly the same way as Rainbow. Just implement it on the Learner side.
EpisodeMemory
This is my own implementation. I came up with the idea after seeing the demo play, but if the demo play is meaningful, why not get this while learning? I thought.
Specifically, the play with the highest total episode reward is saved separately in memory, The idea is to mix this into a batch in the same way as this time.
As an image, it feels like I remember the play that happened to be successful and review it many times.
Below is the implementation.
Create an EpisodeMemory that includes ReplayMemory. This is a wrapper class that adds experience to ReplayMemory on a per-episode basis.
EpisodeMemory
class EpisodeMemory():
def __init__(self, memory):
self.max_reward = None
self.memory = memory
def add_episode(self, episode, total_reward):
# max_Add episode to memory when reward is updated
if self.max_reward is None:
self.max_reward = total_reward
elif self.max_reward <= total_reward: #Add to memory even if it is in the same row
self.max_reward = total_reward
else:
return
#Actual memory additional processing
for e in episode_recent:
if len(e) == 5: #Processing when there is priority
self.memory.add(e, e[4])
else:
self.memory.add(e)
name | Contents |
---|---|
episode_memory | Memory type(Similar to replay memory) |
episode_ratio | EpisodeMemory rate |
rainbow
from src.memory.EpisodeMemory import EpisodeMemory
def __init__(self):
(abridgement)
#Wrap in EpisodeMemory class
self.episode_memory = EpisodeMemory(episode_memory)
self.episode_ratio = episode_ratio
(abridgement)
reset_states
#Called at the beginning of the episode
def reset_states(self):
(abridgement)
#For saving episode experience
self.episode_exp = []
self.total_reward = 0
#For checking the end status
self.recent_terminal = False
forward
#Called before action execution in each step
def forward(self, observation):
(abridgement)
#If finished, episode_Add to memory
if self.recent_terminal:
self.episode_memory.add_episode(self.episode_exp, self.total_reward)
(abridgement)
exp =(Create experience data)
self.memory.add(exp) # replay_Add memory
self.episode_exp.append(exp) # episode_Add experience for memory
(abridgement)
backward
#Called after action is executed in each step
def backward(self, reward, terminal):
(abridgement)
#Calculate total experience
self.total_reward += reward
#Save exit status
self.recent_terminal = terminal
(abridgement)
rainbow
import random
def forward(self, observation):
#It is the timing at the time of learning
(abridgement)
ratio_demo =(Calculation of demo ratio)
# episode_If memory has memory, mix it in batch
if len(self.episode_memory) < self.batch_size:
ratio_epi = 0
else:
ratio_epi = self.episode_ratio
#Calculate the number of batches according to the ratio
batch_replay = 0
batch_demo = 0
batch_episode = 0
for _ in range(self.batch_size):
r = random.random()
if r < ratio_demo:
batch_demo += 1
continue
r -= ratio_demo
if r < ratio_epi:
batch_episode += 1
continue
batch_replay += 1
#Create batch based on ratio
indexes = []
batchs = []
weights = []
memory_types = [] #Save the acquired memory type
if batch_replay > 0:
(replay_batch creation of memory)
if batch_demo > 0:
(demo_batch creation of memory)
if batch_episode > 0:
(i, b, w) = self.episode_memory.sample(batch_episode, self.local_step)
indexes.extend(i)
batchs.extend(b)
weights.extend(w)
# episode_memory is 2
memory_types.extend([2 for _ in range(batch_episode)])
(abridgement)
for i in range(self.batch_size):
(Learning)
#Update priority
if memory_types[i] == 0:
(replay_Update memory)
elif memory_types[i] == 1:
(demo_Update memory)
elif memory_types[i] == 2:
# episode_update memory
self.episode_memory.update(indexes[i], batchs[i], priority)
else:
assert False
(abridgement)
It is almost the same as the implementation on Rainbow. However, episode data is created for each actor, It is managed on the Learner side to reduce the amount of interprocess communication.
Learner
class Learner():
def __init__():
(abridgement)
#Create variables for episode management for each Actor by initializing Learner
self.episode_exp = [ [] for _ in range(self.actors_num)]
self.total_reward = [ 0 for _ in range(self.actors_num)]
def train(self):
(abridgement)
#Actor → Add experience to Learner
for _ in range(self.exp_q.qsize()):
exp = self.exp_q.get(timeout=1)
# add memory
self.memory.add(exp[0], exp[0][4])
# add episode_exp
self.total_reward[exp[1]] += exp[0][2]
self.episode_exp[exp[1]].append(exp[0])
if exp[2]: # terminal
self.episode_memory.add_episode(
self.episode_exp[exp[1]],
self.total_reward[exp[1]]
)
self.episode_exp[exp[1]] = []
self.total_reward[exp[1]] = 0
(abridgement)
Actor
class Actor():
def forward(self, observation):
(abridgement)
#Send to Learner
# actor_Also pass index and terminal information
self.exp_q.put((exp, self.actor_index, self.recent_terminal))
(abridgement)
MountainCar
This time I will try it with MountainCar.
Mountain Car is a game where you move the car left and right to aim for the flag in the upper right.
The reward is always -1. In short, the sooner you reach the flag, the higher your score.
Thinking of it as Q-learning, it is a task that you cannot get rewards until you reach the goal (it is unknown whether it is good or bad), and it is a rather difficult task.
Processor is a pure MountainCar training provided by the gym without definition. The log is acquired every 2000 steps. (This is the operation in Rainbow)
The first 50,000 steps are warmed up and then 100,000 times are learned. ·result
It's hard to learn until the play that reached the goal is stored in memory to some extent. The results are starting to come out around 130,000 steps.
The parameters other than the DemoReplay memory are the same as the previous results. The demo has prepared only one episode below.
The parameters of DemoReplay memory are as follows.
demo_memory = PERProportionalMemory(100_000, alpha=0.8)
demo_episode_dir = episode_save_dir
demo_ratio_initial = 1.0
demo_ratio_final = 1.0/512.0
demo_ratio_steps = warmup + 50_000
·result
Results have already begun to appear around 70,000.
The parameters other than EpisodeMemory are the same as the previous results. The parameters of EpisodeMemory are as follows.
episode_memory = PERProportionalMemory(2_000, alpha=0.8),
episode_ratio = 1.0/8.0,
·result
Results are starting to come out around 80,000.
Both DemoReplay memory and EpisodeMemory are valid. (The x-axis does not include the number of warmups)
It was easier to implement than I expected. There are not many tasks without demo play for learning, so I think it is a very effective method. The evolution of reinforcement learning hasn't stopped yet. I'm looking forward to seeing what comes next.
Recommended Posts