Previously implemented R2D2, but I couldn't implement mini-batch learning. After that, I managed to implement it this time through trial and error.
It's been a long time since the previous article, so I'll explain the overall flow roughly. We will also fix any mistakes in the previous implementation. .. ..
In addition, this article consists of two parts, a commentary section and a hyperparameter setting section. See below for hyperparameters [Reinforcement learning] R2D2 implementation / explanation revenge hyperparameter explanation (Keras-RL)
Postscript: R2D3 has also been implemented. [Reinforcement learning] I implemented / explained R2D3 (Keras-RL)
The code created in this article is below. This time only github.
As a review, I will explain the image of the implementation of DQN (Rainbow) again. See the previously posted article for a detailed explanation.
The image of learning with DQN (Rainbow) is summarized below.
DQN stores experience data (experience) in memory as follows.
If step of Multi-step learning is 1, the next state will be $ t + 1 $, If it is 3steps, it will be $ t + 3 $.
Formula | Figure | |
---|---|---|
Previous state | observation: t(n-6) ~ t(n-3) | |
Next state | observation: t(n-3) ~ t(n) | |
action | action: t(n-3) | |
Reward | reward: t(n) |
In addition, the size held inside each variable is as follows.
Length to hold | Length to save in memory | |
---|---|---|
rewards | multisteps | 0(Used for calculations only) |
Calculated rewards | 1 | 1(Current state) |
actions | multisteps + 1 | 1(Previous state) |
observations | input_sequence + multisteps | input_sequence + multisteps |
In the previous article Multi-Step learning, I referred to the action with $ t_n $, which is incorrect. Hey ... $ T_ {n-multisteps} $ was the correct answer because it refers to the action in the previous state.
The previous article is below.
To put it simply, priority sampling is prioritized when retrieving experiences with Priority Experience Reply. Then, the number of experiences to be acquired will be biased. Then, the bias will put a bias on the learning, so it is the importance sampling to correct this.
Specifically, experiences selected with a high probability have a low reflection rate in the update of the Q value, and experiences selected with a low probability have a high reflection rate in the update of the Q value.
In the past, the implementation seemed to be subtly strange and it didn't learn well. I used to apply it to the updated Q value itself, but I should have applied it to td_error itself. (Variable naming was not good) Also, since it is reflected in the update of the Q value, it is not applied to priority.
-Previous implementation (pseudo code)
IS
def train():
#Gain experience from PER according to probability
batchs, batch_weight = memory.sample(batch_size)
#Get the Q value of the previous state from model
# state0_qvals contains the Q value for each action
state0_qvals = model.predict(state0_batch)
for batch_i in range(batch_size):
reward = batchs[batch_i]Reward
action = batchs[batch_i]Action
q0 = state0_qvals[batch_i][action] #Q value before update
#model and target_Get the maximum Q value of the current state using model
# (The acquisition method is different between DQN and DDQN)
maxq =model and target_Get from model
td_error = reward + (gamma ** reward_multisteps) * maxq
td_error *= batch_weight
priority = abs(td_error - q0)
#Learn by changing only the Q value of the target action
state0_qvals[batch_i][action] = td_error
# train
model.train_on_batch(state0_qvals)
-Implementation after change (pseudo code)
IS
def train():
#Gain experience from PER according to probability
batchs, batch_weight = memory.sample(batch_size)
#Get the Q value of the previous state from model
# state0_qvals contains the Q value for each action
state0_qvals = model.predict(state0_batch)
for batch_i in range(batch_size):
reward = batchs[batch_i]Reward
action = batchs[batch_i]Action
q0 = state0_qvals[batch_i][action] #Q value before update
#model and target_Get the maximum Q value of the current state using model
# (The acquisition method is different between DQN and DDQN)
maxq =model and target_Get from model
#※ -Add q0 and td properly_Issue an error
#* Also, batch_Apply weight here
td_error = reward + (gamma ** reward_multisteps) * maxq - q0
#※ td_The absolute value of error becomes priority as it is
priority = abs(td_error)
#Learn by changing only the Q value of the target action
#※ td_Since the error became a difference, apply the weight to it and update the Q value with the difference.
state0_qvals[batch_i][action] += td_error * batch_weight
# train
model.train_on_batch(state0_qvals)
Previously, mini-bachi learning could not be implemented because Keras's stateful LSTM was not well understood. The previous survey articles are as follows.
Apparently there are batch_size worth of states in hidden_states and you can specify them. Now you can proceed with multiple learnings between sequences at the same time.
DRQN(R2D2) For the sake of clarity, I will explain with R2D2, which has the parallel processing part removed. The previous article is below.
It is an image diagram like DQN.
It's getting pretty complicated ... I wrote this figure because I was confused when implementing it ...
The method of updating the Q value and issuing Priority is the same as DQN, so it is omitted from the figure.
The points are input sequence and input length. Last time I wasn't aware of this. (Assuming input sequence = 1, input length was expressed as input sequence)
The input sequence is the length of the state to be input to model, and the number of inputs is the input length. The Q value is updated for each input length, and Priority is also calculated. (I'm a little unsure about this interpretation, but R2D2's paper Section 2.3 proposes a new way to put out Priority, and it makes sense to think that one experience gives multiple Priority as described above.)
The size held inside each variable is as follows.
Length to hold | Length to save in memory | |
---|---|---|
rewards | multisteps + input_length - 1 | 0(Used for calculations only) |
Calculated rewards | input_length | input_length |
actions | multisteps + input_length | input_length(From the previous state) |
hidden states | burnin + multisteps + input_length + 1 | 1(Oldest state) |
observations | burnin + input_sequence + multisteps + input_length - 1 | 0(For summarizing below) |
Summary observations | burnin + multisteps + input_length | Same length |
The rescaling function was introduced in R2D2 and was to be used instead of reward clipping (-1 to 1). I used to worry about the inverse function, but I forcibly made it unnecessary.
The formula for deriving the TD error using the rescaling function is: ($ y_t $ is the TD error)
Expand $ h () $ in the above formula.
Applying the inverse function to a function returns it to its original value. * $ H (h ^ {-1} (x)) = x $ So the right side can be offset ($ \ gamma $ is ignored as an error ...)
Then it becomes as follows.
The rescaling function is now applied only to rewards ($ r_ {t} $). If you look at the graph, you can see that the rewards are rounded nicely. (100 rewards will be around 10) It's a good alternative to clipping.
The previous article is below.
Reference: Complete understanding of Python threading and multiprocessing
At first, I used Queue, but since the amount of weights data was large and it seemed to be a bottleneck, I investigated the communication between each process. The survey results are the following articles.
From this, the communication is as follows. (As a result, Queue is used as it is)
Information exchange between processes is implemented in shared memory. I don't lock it because the writer and the reader are clearly separated.
Callbacks
It turns out that interprocess communication is quite costly, I implemented it because there was a process that straddled Actor and Leaner.
I mainly create it for save / load and logs. The base class of the implemented Callback is as follows.
R2D2Callback
import rl.callbacks
class R2D2Callback(rl.callbacks.Callback):
def __init__(self):
pass
#--- train ---
def on_r2d2_train_begin(self):
pass
def on_r2d2_train_end(self):
pass
#--- learner ---
def on_r2d2_learner_begin(self, learner):
pass
def on_r2d2_learner_end(self, learner):
pass
def on_r2d2_learner_train_begin(self, learner):
pass
def on_r2d2_learner_train_end(self, learner):
pass
#--- actor ---
#Below and rl.callbacks.Callback inheritance method
def on_r2d2_actor_begin(self, actor_index, runner):
pass
def on_r2d2_actor_end(self, actor_index, runner):
pass
As you can see, it inherits from Keras-rl's Callback. It is used as is by the Agent.
Note that train, learner, and actor are supposed to be called by another process. So even if you write a process that straddles these, the value will not be retained because the process is different.
Save / load and log using these are explained in the parameter section.
GPU
When I run GPU as it is with tensorflow 2.1.0, I get the following error.
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(32, 12), b.shape=(12, 128), m=32, n=128, k=12
Apparently it's an error that occurs when using GPU in multiple processes. Refer to the following and set to use GPU in multiple processes.
#I want you to set it for all processes, so it is described globally
for device in tf.config.experimental.list_physical_devices('GPU'):
tf.config.experimental.set_memory_growth(device, True)
Also, I am writing a process to automatically determine whether it is a CPU or GPU inside R2D2Manager.
import tensorflow as tf
def train(self):
(abridgement)
if len(tf.config.experimental.list_physical_devices('GPU')) > 0:
self.enable_GPU = True
else:
self.enable_GPU = False
(abridgement)
The image processing layer in NN (Neural Network) is unchanged from DQN. So I extended it so that I can change it here.
The NN layer in DQN is as follows.
layer | Overview | |
---|---|---|
1 | Input layer | |
2 | Input conversion layer | Layer to generalize input format |
3 | Image processing layer | For image processing |
4 | LSTM layer | When using LSTM |
5 | dueling network layer | When using dueling network |
6 | Dense layer | Included when using dueling network |
7 | (Output layer) | Actually included in the dueling network layer |
The input conversion layer is a layer that makes a one-dimensional output (Flatten) for the input format. It is created assuming the following four types of input.
InputType
import enum
class InputType(enum.Enum):
VALUES = 1 #No image
GRAY_2ch = 3 # (width, height)
GRAY_3ch = 4 # (width, height, 1)
COLOR = 5 # (width, height, ch)
Just flatten it.
input_sequence = 4
input_shape = 3
c = Input(shape=(input_sequence,) + input_shape)
# output_shape == (None, 4, 3)
c = Flatten()(c)
# output_shape == (None, 12)
Just flatten it as it is. Wrapped in TimeDistributed to hold timesteps.
batch_size = 16
input_sequence = 4
input_shape = 3
c = Input(batch_shape=(batch_size, input_sequence,) + input_shape)
# output_shape == (16, 4, 3)
c = TimeDistributed(Flatten())(c)
# output_shape == (16, 4, 3)
It's the conversion used in DQN. Replace the channel with input_sequence (input size).
input_sequence = 4
input_shape = (84, 84) #(widht, height)
c = Input(shape=(input_sequence,) + input_shape)
# output_shape == (None, 4, 84, 84)
c = Permute((2, 3, 1))(c) #Layer to change the order
# output_shape == (None, 84, 84, 4)
c =Image processing layer(c)
If LSTM is enabled, sequence information can be supplemented with timesteps, so We are increasing the channel layer.
batch_size = 16
input_sequence = 4
input_shape = (84, 84) #(widht, height)
c = Input(batch_shape=(batch_size, input_sequence,) + input_shape)
# output_shape == (16, 4, 84, 84)
c = Reshape((input_sequence, ) + input_shape + (1,) )(c) #Add channel layer
# output_shape == (16, 4, 84, 84, 1)
c =Image processing layer(c)
Pass it to the image processing layer as it is. However, the information of input_sequence cannot be expressed.
input_sequence = 4
input_shape = (84, 84, 3) #(widht, height, channel)
c = Input(shape=input_shape)
# output_shape == (None, 84, 84, 3)
c =Image processing layer(c)
There is no difference.
batch_size = 16
input_sequence = 4
input_shape = (84, 84, 3) #(widht, height, channel)
c = Input(batch_shape=(batch_size, input_sequence,) + input_shape)
# output_shape == (16, 4, 84, 84, 3)
c =Image processing layer(c)
The ImageModel class is defined so that the layer can be changed.
The argument c of create_image_model is passed in the following format.
No LSTM: shape(batch_size, width, height, channel)
With LSTM: shape(batch_size, timesteps, width, height, channel)
The return value should be in the following format:
No LSTM: shape(batch_size, dim)
With LSTM: shape(batch_size, timesteps, dim)
The following is an example of DQN format.
DQNImageModel
class DQNImageModel(ImageModel):
""" native dqn image model
https://arxiv.org/abs/1312.5602
"""
def create_image_model(self, c, enable_lstm):
"""
c shape(batch_size, width, height, channel)
return shape(batch_size, dim)
"""
if enable_lstm:
c = TimeDistributed(Conv2D(32, (8, 8), strides=(4, 4), padding="same"), name="c1")(c)
c = Activation("relu")(c)
c = TimeDistributed(Conv2D(64, (4, 4), strides=(2, 2), padding="same"), name="c2")(c)
c = Activation("relu")(c)
c = TimeDistributed(Conv2D(64, (3, 3), strides=(1, 1), padding="same"), name="c3")(c)
c = Activation("relu")(c)
c = TimeDistributed(Flatten())(c)
else:
c = Conv2D(32, (8, 8), strides=(4, 4), padding="same", name="c1")(c)
c = Activation("relu")(c)
c = Conv2D(64, (4, 4), strides=(2, 2), padding="same", name="c2")(c)
c = Activation("relu")(c)
c = Conv2D(64, (3, 3), strides=(1, 1), padding="same", name="c3")(c)
c = Activation("relu")(c)
c = Flatten()(c)
return c
The previous commentary article is below.
DQN uses only the search policy of the ε-greedy method, but there are some policies introduced in the above article. I implemented them so that they can be used, but it seems that ε-greedy is enough. Details will be explained in the parameter section.
I implemented it for the time being. Next time, I would like to make a sample article on how to set each parameter.
Recommended Posts