Previously implemented R2D2, but I couldn't implement mini-batch learning. After that, I managed to implement it this time through trial and error.

It's been a long time since the previous article, so I'll explain the overall flow roughly. We will also fix any mistakes in the previous implementation. .. ..

Please note that it may not be accurate because it collects information on the net and implements it in its own way.

In addition, this article consists of two parts, a commentary section and a hyperparameter setting section. See below for hyperparameters [Reinforcement learning] R2D2 implementation / explanation revenge hyperparameter explanation (Keras-RL)

Postscript: R2D3 has also been implemented. [Reinforcement learning] I implemented / explained R2D3 (Keras-RL)

Whole code

The code created in this article is below. This time only github.

github

DQN (Rainbow) implementation explanation
R2D2 implementation explanation
Other
Image Model extension
Policy expansion

Implementation explanation of DQN (Rainbow)

As a review, I will explain the image of the implementation of DQN (Rainbow) again. See the previously posted article for a detailed explanation.

About Q-learning (TD error)
About DQN
About Rainbow
About Multi-Step learning
[About Priority Experience Reply (priority)](https://qiita.com/pocokhc/items/fc00f8ea9dca8f8c0297#priority-experience-reply%E5%84%AA%E5%85%88%E9%A0%86%E4 % BD% 8D% E4% BB% 98% E3% 81% 8D% E7% B5% 8C% E9% A8% 93% E5% 86% 8D% E7% 94% 9F)

The image of learning with DQN (Rainbow) is summarized below.

DQN stores experience data (experience) in memory as follows.

e_{t} = (s_{t},a_{t},r_{t},s_{t+1})

If step of Multi-step learning is 1, the next state will be $ t + 1 $, If it is 3steps, it will be $ t + 3 $.

	Formula	Figure
Previous state	s_{t}	observation: t(n-6) ～ t(n-3)
Next state	s_{t+1}	observation: t(n-3) ～ t(n)
action	a_{t}	action: t(n-3)
Reward	r_{t}	reward: t(n)

In addition, the size held inside each variable is as follows.

	Length to hold	Length to save in memory
rewards	multisteps	0(Used for calculations only)
Calculated rewards	1	1(Current state)
actions	multisteps + 1	1(Previous state)
observations	input_sequence + multisteps	input_sequence + multisteps

Incorrect action referenced in Multi-Step learning

In the previous article Multi-Step learning, I referred to the action with $ t_n $, which is incorrect. Hey ... $ T_ {n-multisteps} $ was the correct answer because it refers to the action in the previous state.

Incorrect importance sampling

The previous article is below.

[Importance Sampling (IS)](https://qiita.com/pocokhc/items/fc00f8ea9dca8f8c0297#%E9%87%8D%E8%A6%81%E5%BA%A6%E3%82% B5% E3% 83% B3% E3% 83% 97% E3% 83% AA% E3% 83% B3% E3% 82% B0is-importance-sampling)

To put it simply, priority sampling is prioritized when retrieving experiences with Priority Experience Reply. Then, the number of experiences to be acquired will be biased. Then, the bias will put a bias on the learning, so it is the importance sampling to correct this.

Specifically, experiences selected with a high probability have a low reflection rate in the update of the Q value, and experiences selected with a low probability have a high reflection rate in the update of the Q value.

In the past, the implementation seemed to be subtly strange and it didn't learn well. I used to apply it to the updated Q value itself, but I should have applied it to td_error itself. (Variable naming was not good) Also, since it is reflected in the update of the Q value, it is not applied to priority.

-Previous implementation (pseudo code)

`IS`


def train():

    #Gain experience from PER according to probability
    batchs, batch_weight = memory.sample(batch_size)

    #Get the Q value of the previous state from model
    # state0_qvals contains the Q value for each action
    state0_qvals = model.predict(state0_batch)
    
    for batch_i in range(batch_size):
        reward = batchs[batch_i]Reward
        action = batchs[batch_i]Action
        q0 = state0_qvals[batch_i][action]  #Q value before update

        #model and target_Get the maximum Q value of the current state using model
        # (The acquisition method is different between DQN and DDQN)
        maxq =model and target_Get from model

        td_error = reward + (gamma ** reward_multisteps) * maxq
        td_error *= batch_weight

        priority = abs(td_error - q0)
        
        #Learn by changing only the Q value of the target action
        state0_qvals[batch_i][action] = td_error

    # train
    model.train_on_batch(state0_qvals)

-Implementation after change (pseudo code)

Is the changed part

`IS`


def train():

    #Gain experience from PER according to probability
    batchs, batch_weight = memory.sample(batch_size)

    #Get the Q value of the previous state from model
    # state0_qvals contains the Q value for each action
    state0_qvals = model.predict(state0_batch)
    
    for batch_i in range(batch_size):
        reward = batchs[batch_i]Reward
        action = batchs[batch_i]Action
        q0 = state0_qvals[batch_i][action]  #Q value before update

        #model and target_Get the maximum Q value of the current state using model
        # (The acquisition method is different between DQN and DDQN)
        maxq =model and target_Get from model

        #※ -Add q0 and td properly_Issue an error
        #* Also, batch_Apply weight here
        td_error = reward + (gamma ** reward_multisteps) * maxq - q0

        #※ td_The absolute value of error becomes priority as it is
        priority = abs(td_error)
        
        #Learn by changing only the Q value of the target action
        #※ td_Since the error became a difference, apply the weight to it and update the Q value with the difference.
        state0_qvals[batch_i][action] += td_error * batch_weight

    # train
    model.train_on_batch(state0_qvals)

R2D2 implementation description

Mini batch learning

Previously, mini-bachi learning could not be implemented because Keras's stateful LSTM was not well understood. The previous survey articles are as follows.

Apparently there are batch_size worth of states in hidden_states and you can specify them. Now you can proceed with multiple learnings between sequences at the same time.

DRQN(R2D2) For the sake of clarity, I will explain with R2D2, which has the parallel processing part removed. The previous article is below.

[Reinforcement learning] I implemented / explained R2D2, which is rumored to be the strongest in 2018

It is an image diagram like DQN.

It's getting pretty complicated ... I wrote this figure because I was confused when implementing it ...

The method of updating the Q value and issuing Priority is the same as DQN, so it is omitted from the figure.

The points are input sequence and input length. Last time I wasn't aware of this. (Assuming input sequence = 1, input length was expressed as input sequence)

The input sequence is the length of the state to be input to model, and the number of inputs is the input length. The Q value is updated for each input length, and Priority is also calculated. (I'm a little unsure about this interpretation, but R2D2's paper Section 2.3 proposes a new way to put out Priority, and it makes sense to think that one experience gives multiple Priority as described above.)

The size held inside each variable is as follows.

	Length to hold	Length to save in memory
rewards	multisteps + input_length - 1	0(Used for calculations only)
Calculated rewards	input_length	input_length
actions	multisteps + input_length	input_length(From the previous state)
hidden states	burnin + multisteps + input_length + 1	1(Oldest state)
observations	burnin + input_sequence + multisteps + input_length - 1	0(For summarizing below)
Summary observations	burnin + multisteps + input_length	Same length

rescaling function

h(x) = sign(x)(\sqrt{|x|+1}-1)+\epsilon x

The rescaling function was introduced in R2D2 and was to be used instead of reward clipping (-1 to 1). I used to worry about the inverse function, but I forcibly made it unnecessary.

The formula for deriving the TD error using the rescaling function is: ($ y_t $ is the TD error)

y_{t} = h \Bigl(r_{t} + \gamma h^{-1}(\max_pQ_{target}(s_{t+1},a_{t}))\Bigr)

Expand $ h () $ in the above formula.

y_{t} = h (r_{t}) + h \Bigl(\gamma h^{-1}(\max_pQ_{target}(s_{t+1},a_{t}))\Bigr)

Applying the inverse function to a function returns it to its original value. * $ H (h ^ {-1} (x)) = x $ So the right side can be offset ($ \ gamma $ is ignored as an error ...)

Then it becomes as follows.

y_{t} = h (r_{t}) + \gamma (\max_pQ_{target}(s_{t+1},a_{t}))

The rescaling function is now applied only to rewards ($ r_ {t} $). If you look at the graph, you can see that the rewards are rounded nicely. (100 rewards will be around 10) It's a good alternative to clipping.

Parallel processing (interprocess communication)

The previous article is below.

Reference: Complete understanding of Python threading and multiprocessing

At first, I used Queue, but since the amount of weights data was large and it seemed to be a bottleneck, I investigated the communication between each process. The survey results are the following articles.

I measured various methods of interprocess communication in multiprocessing of python3

From this, the communication is as follows. (As a result, Queue is used as it is)

Information exchange between processes is implemented in shared memory. I don't lock it because the writer and the reader are clearly separated.

Callbacks

It turns out that interprocess communication is quite costly, I implemented it because there was a process that straddled Actor and Leaner.

I mainly create it for save / load and logs. The base class of the implemented Callback is as follows.

`R2D2Callback`


import rl.callbacks
class R2D2Callback(rl.callbacks.Callback):
    def __init__(self):
        pass

    #--- train ---

    def on_r2d2_train_begin(self):
        pass

    def on_r2d2_train_end(self):
        pass

    #--- learner ---

    def on_r2d2_learner_begin(self, learner):
        pass
    
    def on_r2d2_learner_end(self, learner):
        pass

    def on_r2d2_learner_train_begin(self, learner):
        pass

    def on_r2d2_learner_train_end(self, learner):
        pass

    #--- actor ---
    #Below and rl.callbacks.Callback inheritance method

    def on_r2d2_actor_begin(self, actor_index, runner):
        pass

    def on_r2d2_actor_end(self, actor_index, runner):
        pass

As you can see, it inherits from Keras-rl's Callback. It is used as is by the Agent.

Note that train, learner, and actor are supposed to be called by another process. So even if you write a process that straddles these, the value will not be retained because the process is different.

Save / load and log using these are explained in the parameter section.

GPU

When I run GPU as it is with tensorflow 2.1.0, I get the following error.

tensorflow.python.framework.errors_impl.InternalError:  Blas GEMM launch failed : a.shape=(32, 12), b.shape=(12, 128), m=32, n=128, k=12

Apparently it's an error that occurs when using GPU in multiple processes. Refer to the following and set to use GPU in multiple processes.

#I want you to set it for all processes, so it is described globally
for device in tf.config.experimental.list_physical_devices('GPU'):
    tf.config.experimental.set_memory_growth(device, True)

Also, I am writing a process to automatically determine whether it is a CPU or GPU inside R2D2Manager.

import tensorflow as tf

def train(self):
    (abridgement)
    if len(tf.config.experimental.list_physical_devices('GPU')) > 0:
        self.enable_GPU = True
    else:
        self.enable_GPU = False
    (abridgement)

Other implementations

ImageModel extension

The image processing layer in NN (Neural Network) is unchanged from DQN. So I extended it so that I can change it here.

The NN layer in DQN is as follows.

	layer	Overview
1	Input layer
2	Input conversion layer	Layer to generalize input format
3	Image processing layer	For image processing
4	LSTM layer	When using LSTM
5	dueling network layer	When using dueling network
6	Dense layer	Included when using dueling network
7	(Output layer)	Actually included in the dueling network layer

Generalization of input conversion layer

The input conversion layer is a layer that makes a one-dimensional output (Flatten) for the input format. It is created assuming the following four types of input.

`InputType`


import enum
class InputType(enum.Enum):
    VALUES = 1    #No image
    GRAY_2ch = 3  # (width, height)
    GRAY_3ch = 4  # (width, height, 1)
    COLOR = 5     # (width, height, ch)

Input layer without image (VALUES) (without LSTM)

Just flatten it.

input_sequence = 4
input_shape = 3

c = Input(shape=(input_sequence,) + input_shape)
# output_shape == (None, 4, 3)

c = Flatten()(c)
# output_shape == (None, 12)

Input layer without image (VALUES) (with LSTM)

Just flatten it as it is. Wrapped in TimeDistributed to hold timesteps.

batch_size = 16
input_sequence = 4
input_shape = 3

c = Input(batch_shape=(batch_size, input_sequence,) + input_shape)
# output_shape == (16, 4, 3)

c = TimeDistributed(Flatten())(c)
# output_shape == (16, 4, 3)

Gray image (no channel) input layer (GRAY_2ch) (without LSTM)

It's the conversion used in DQN. Replace the channel with input_sequence (input size).

input_sequence = 4
input_shape = (84, 84)  #(widht, height)

c = Input(shape=(input_sequence,) + input_shape)
# output_shape == (None, 4, 84, 84)

c = Permute((2, 3, 1))(c)  #Layer to change the order
# output_shape == (None, 84, 84, 4)

c =Image processing layer(c)

Input layer for gray image (without channel) (GRAY_2ch) (with LSTM)

If LSTM is enabled, sequence information can be supplemented with timesteps, so We are increasing the channel layer.

batch_size = 16
input_sequence = 4
input_shape = (84, 84)  #(widht, height)

c = Input(batch_shape=(batch_size, input_sequence,) + input_shape)
# output_shape == (16, 4, 84, 84)

c = Reshape((input_sequence, ) + input_shape + (1,) )(c)  #Add channel layer
# output_shape == (16, 4, 84, 84, 1)

c =Image processing layer(c)

Image (with channel) input layer (GRAY_3ch, COLOR) (without LSTM)

Pass it to the image processing layer as it is. However, the information of input_sequence cannot be expressed.

input_sequence = 4
input_shape = (84, 84, 3)  #(widht, height, channel)

c = Input(shape=input_shape)
# output_shape == (None, 84, 84, 3)

c =Image processing layer(c)

Image (with channel) input layer (GRAY_3ch, COLOR) (with LSTM)

There is no difference.

batch_size = 16
input_sequence = 4
input_shape = (84, 84, 3)  #(widht, height, channel)

c = Input(batch_shape=(batch_size, input_sequence,) + input_shape)
# output_shape == (16, 4, 84, 84, 3)

c =Image processing layer(c)

Generalization of image processing layer

The ImageModel class is defined so that the layer can be changed.

The argument c of create_image_model is passed in the following format.

No LSTM: shape(batch_size, width, height, channel) 
With LSTM: shape(batch_size, timesteps, width, height, channel)

The return value should be in the following format:

No LSTM: shape(batch_size, dim) 
With LSTM: shape(batch_size, timesteps, dim)

The following is an example of DQN format.

`DQNImageModel`


class DQNImageModel(ImageModel):
    """ native dqn image model
    https://arxiv.org/abs/1312.5602
    """

    def create_image_model(self, c, enable_lstm):
        """
        c shape(batch_size, width, height, channel)
        return shape(batch_size, dim)
        """

        if enable_lstm:
            c = TimeDistributed(Conv2D(32, (8, 8), strides=(4, 4), padding="same"), name="c1")(c)
            c = Activation("relu")(c)
            
            c = TimeDistributed(Conv2D(64, (4, 4), strides=(2, 2), padding="same"), name="c2")(c)
            c = Activation("relu")(c)
            
            c = TimeDistributed(Conv2D(64, (3, 3), strides=(1, 1), padding="same"), name="c3")(c)
            c = Activation("relu")(c)
            
            c = TimeDistributed(Flatten())(c)

        else:
                
            c = Conv2D(32, (8, 8), strides=(4, 4), padding="same", name="c1")(c)
            c = Activation("relu")(c)

            c = Conv2D(64, (4, 4), strides=(2, 2), padding="same", name="c2")(c)
            c = Activation("relu")(c)

            c = Conv2D(64, (3, 3), strides=(1, 1), padding="same", name="c3")(c)
            c = Activation("relu")(c)

            c = Flatten()(c)
        return c

Policy extension

The previous commentary article is below.

[Reinforcement learning] Implemented / explained and compared multiple search policies (multi-armed bandit problem)

DQN uses only the search policy of the ε-greedy method, but there are some policies introduced in the above article. I implemented them so that they can be used, but it seems that ε-greedy is enough. Details will be explained in the parameter section.

Afterword

I implemented it for the time being. Next time, I would like to make a sample article on how to set each parameter.

[Reinforcement learning] R2D2 implementation / explanation revenge commentary (Keras-RL)

Whole code

table of contents

Implementation explanation of DQN (Rainbow)

Incorrect action referenced in Multi-Step learning

Incorrect importance sampling

`IS`

`IS`

R2D2 implementation description

Mini batch learning

rescaling function

Parallel processing (interprocess communication)

`R2D2Callback`

Other implementations

ImageModel extension

Generalization of input conversion layer

`InputType`

Input layer without image (VALUES) (without LSTM)

Input layer without image (VALUES) (with LSTM)

Gray image (no channel) input layer (GRAY_2ch) (without LSTM)

Input layer for gray image (without channel) (GRAY_2ch) (with LSTM)

Image (with channel) input layer (GRAY_3ch, COLOR) (without LSTM)

Image (with channel) input layer (GRAY_3ch, COLOR) (with LSTM)

Generalization of image processing layer

`DQNImageModel`

Policy extension

Afterword

Paper links