This is a hyperparameter explanation. I summarized each parameter.

Click here for the algorithm of the contents [Reinforcement learning] R2D2 implementation / explanation revenge commentary (Keras-RL)

Whole code

The target code for this article is on github.

github

Explanation of common parameters
Explanation of parameters specific to Rainbow / R2D2
Explanation of functions other than learning
Visualization of the middle layer of the NN model
Animation
save/load
Get learning history log
Hyperparameter setting example
DQN (Paper)
Keras-RL Cartpole sample
Rainbow (Paper)
R2D2 (Paper)

Common parameters

This parameter is common to Rainbow (DQN) and R2D2.

env dependency parameters

	Overview	Mold	Example	Remarks
input_shape	Input shape	tuple	(84,84)	env.observation_space.shape
input_type	Specify input format	InputType	InputType.GRAY_2ch	Original implementation
image_model	Image layer model format	ImageModel(Original implementation)	DQNImageModel()
nb_actions	Number of actions(Number of outputs)	int	4	env.action_space.n
processor	Classes that provide custom gym functionality	Processor(Keras-rl)	None

input_shape Specify the input format with tuple. If it's an image, it's in (width, height) format. In the case of Gym, it is the format that can be obtained by env.observation_space.shape.
input_type It is a specification that supplements the above input_shape. It is implemented independently. The following four types are assumed, and specify according to the contents of input_shape.

`InputType`


class InputType(enum.Enum):
    VALUES = 1    #No image
    GRAY_2ch = 3  # (width, height)
    GRAY_3ch = 4  # (width, height, 1)
    COLOR = 5     # (width, height, ch)

image_model This is the content explained in Previous article. However, there are currently only two types. If it is not an image, specify None, and if it is an image, specify DQNImageModel ().
nb_actions Specify the output format as int. This corresponds to the number of actions the agent can select. For example, if you stop left and right, there will be 3 actions if there are 3 operation methods. For Gym, you can get it in env.action_space.n. (Discrete format only)
processor A class that provides customization functionality for Env provided by Gym. (processor (Keras-rl official))

NN (Neural Network) model related parameters

	Overview	Mold	Example	Remarks
batch_size	Batch size	int	32
optimizer	Optimization algorithm	Optimizer(Keras)	Adam(lr=0.0001)	Keras implementation
metrics	Evaluation function	array	[]	Keras implementation
input_sequence	Number of input frames	int	4
dense_units_num	Number of units in the Dense layer	int	512
enable_dueling_network	Whether to use Dueling Network	bool	True
dueling_network_type	Algorithm used in Dueling Network	DuelingNetwork	DuelingNetwork.AVERAGE
lstm_type	Types when using LSTM	LstmType(Original implementation)	LstmType.NONE
lstm_units_num	Number of LSTM layer units	int	512
lstm_ful_input_length	Number of input learnings per learning	int	4	Used only for STATEFUL

batch_size The batch size used for mini-batch learning. (For batch, here (Keras official)) It is said that increasing the batch size increases learning efficiency and learning speed. However, unlike supervised learning, reinforcement learning has no restrictions on learning data, so increasing the batch size increases the learning cost (convergence speed increases in one learning, but it takes time, so the number of times to search for a new experience (Reduces), so I think it is better not to increase too much. Also, the batch size should be 2 ^ n.
optimizer Specifies the Keras Optimizer that you specify when you compile the NN model. For details, refer to How to use the optimizer (optimizer) (Keras official).
metrics Specifies the Keras merit function. I don't know much because I haven't used it ... For details, refer to How to use evaluation function (Keras formula).
input_sequence (formerly window_length) The number of observations to use for input. 1 uses only the latest 1 frame as input, and 4 uses the latest to 4 frames as input. Increasing this value will increase the expressiveness of the input, but will increase the learning cost.
dense_units_num The number of units in the Dense layer. Increasing this value will increase the expressiveness of the NN, but will increase the learning cost.
enable_dueling_network Whether to enable Dueling Network. Dueling Network aims to improve learning efficiency by letting NN learn by separating states and actions.
dueling_network_type This is the algorithm used to separate states and actions in DuelingNetwork. You can specify from the following three types. The paper stated that Average gave the best results.

`DuelingNetwork`


class DuelingNetwork(enum.Enum):
    AVERAGE = 0
    MAX = 1
    NAIVE = 2

lstm_type
Specify the type when using LSTM. With LSTM, you can also learn information about time series, but the learning cost will increase. Specify NONE if you do not want to use it. Specifying STATELESS adds a simple LSTM layer equivalent to DRQN to the NN model. Specifying STATEFUL adds an LSTM layer to the NN model that performs complex processing equivalent to R2D2. STATEFUL is more efficient than STATELESS, but it costs a lot more.

`LstmType`


class LstmType(enum.Enum):
    NONE = 0
    STATELESS = 1
    STATEFUL = 2

lstm_ful_input_length
The number of input learnings per learning in STATEFUL learning. When learning with STATEFUL, you can increase the number of learning times in chronological order. Specify the number of learning times along the time series here. (For details, see Previous article) Increasing the number will (probably) increase learning efficiency, but will increase learning costs.

Experience Replay Memory related

	Overview	Mold	Example	Remarks
memory/remote_memory	Memory to use	Memory(Original implementation)	ReplayMemory(10000)	See below

Specifies the type of memory to store experience. DQN stores the experienced data once in memory. After that, the experience is randomly extracted from the memory and learned. There are several types depending on how to retrieve from memory, so I will explain them.

ReplayMemory This is the simple memory used by DQN. (Previous article) Randomly retrieve empirical data.

ReplayMemory(
    capacity=10_000
)

capacity
Maximum capacity to save in memory. It is better to have more, but if it is too much, the physical memory on the PC side will be compressed, so it is moderate.

PERGreedyMemory [Priorityed Experience Playback](https://qiita.com/pocokhc/items/fc00f8ea9dca8f8c0297#priority-experience-reply%E5%84%AA%E5%85%88%E9%A0%86%E4%BD% 8D% E4% BB% 98% E3% 81% 8D% E7% B5% 8C% E9% A8% 93% E5% 86% 8D% E7% 94% 9F) is a straightforward implementation. This is a method to extract the experience with the largest TD error (the highest reflection rate in learning) rather than random. However, since there are no random elements, I feel like I'm going into a local solution right away, so I can't learn well ... (Why implemented)

PERGreedyMemory(
    capacity=10_000
)

capacity
Maximum capacity to save in memory.

PERProportionalMemory [Priorityed Experience Playback](https://qiita.com/pocokhc/items/fc00f8ea9dca8f8c0297#priority-experience-reply%E5%84%AA%E5%85%88%E9%A0%86%E4%BD% 8D% E4% BB% 98% E3% 81% 8D% E7% B5% 8C% E9% A8% 93% E5% 86% 8D% E7% 94% 9F) Proportional Prioritization memory .. It is a method to extract experience according to the probability distribution of TD error instead of random. (Experience with more TD error has a higher probability of being extracted)

It feels much more efficient than Replay Memory (random selection).

PERGreedyMemory(
    capacity=100000,
    alpha=0.9,
    beta_initial,
    beta_steps,
    enable_is,
)

The parameters will be described later.

PERRankBaseMemory [Priorityed Experience Playback](https://qiita.com/pocokhc/items/fc00f8ea9dca8f8c0297#priority-experience-reply%E5%84%AA%E5%85%88%E9%A0%86%E4%BD% RankBase memory in 8D% E4% BB% 98% E3% 81% 8D% E7% B5% 8C% E9% A8% 93% E5% 86% 8D% E7% 94% 9F). Experience is extracted in proportion to the order of TD error rather than random. For example, if you have three experiences, the first place is 50%, the second place is 33%, and the third place is 17%.

It feels much more efficient than Replay Memory (random selection), I don't really understand the difference from Proportional. This should be a little faster in terms of speed ...

PERRankBaseMemory(
    capacity=100000,
    alpha=0.9,
    beta_initial,
    beta_steps, 
    enable_is,
)

The parameters will be described later.

PERProportionalMemory and PERRankBaseMemory parameters

	Overview	Mold	Example	Remarks
capacity	Maximum capacity to store in memory	int	1_000_000
alpha	Probability reflection rate	float	0.9	0.0～1.0
beta_initial	Initial value of IS reflection rate	float	0.0	0.0～1.0
beta_steps	IS reflection rate 1.Number of steps to 0	int	100_000	Depends on the number of learnings
enable_is	Whether to enable IS	bool	True

capacity Maximum capacity to save in memory.
alpha Reflection rate of Priority / RankBase. (0.0-1.0) 0.0 is completely random (similar to Replay Memory), 1.0 is completely obeyed by the probability distribution.

Here, Importance Sampling (IS) (https://qiita.com/pocokhc/items/fc00f8ea9dca8f8c0297#%E9%87%8D%E8%A6%81%E5%BA%A6%E3%82%B5% E3% 83% B3% E3% 83% 97% E3% 83% AA% E3% 83% B3% E3% 82% B0 is-importance-sampling). When experiencing experiences according to a probability distribution, there is a bias in the number of times each experience is selected. If the number of experience selections is biased, learning will be biased, and it is important to avoid this in importance sampling.

Specifically, experiences selected with a high probability have a low reflection rate in the update of the Q value, and experiences selected with a low probability have a high reflection rate in the update of the Q value.

It seems that learning will be stable by introducing IS. In addition, IS is annealing (gradually reflecting).

beta_initial This is the initial value of the IS reflection rate. (0.0 does not use IS, 1.0 reflects IS)
beta_steps This is the number of steps required to set the IS reflection rate to 1.0. Please specify based on the number of learnings.
enable_is Whether to enable IS.

Learning-related parameters

	Overview	Mold	Example	Remarks
memory_warmup_size/ remote_memory_warmup_size	Size that does not learn until experience is accumulated in memory	int	1000
target_model_update	Update interval to Target model	int	10000
gamma	Q-learning discount rate	float	0.99	0.0～1.0
enable_double_dqn	Whether to use DoubleDQN	bool	True
enable_rescaling	Whether to use the rescaling function	bool	True
rescaling_epsilon	Constants used in the rescaling function	float	0.001
priority_exponent	Ratio when calculating experience priorities	float	0.9	Use only LESTFUL
burnin_length	burn-in period	int	2	Use only LESTFUL
reward_multisteps	Number of steps in MultiStep Reward	int	3

memory_warmup_size / remote_memory_warmup_size In the initial state, there is no experience in memory and learning is not possible. Therefore, make a period of not learning until the experience is accumulated in the memory. Specify the period here. I think a value greater than or equal to batch_size and not too small is good. (If you reduce it too much, you may fall into a local solution if there is biased empirical data at the beginning)
target_model_update The update interval of Target Network in DQN. DQN updates using a Q network dedicated to updating called Target Network. TargetNetwork does not learn and copies the current Q network at regular intervals. By doing this, a time difference will be created in the Q network used for updating, and updating will be better.
gamma [Q-learning discount rate](https://qiita.com/pocokhc/items/8ed40be84a144b28180d#q%E5%AD%A6%E7%BF%92%E3%81%AB%E3%81%A4%E3% 81% 84% E3% 81% A6). Specifies how much the reward is propagated. Well, I think a value close to 1.0 is fine.
enable_double_dqn DoubleDQN was learning by selecting the maximum Q value at the time of learning, but it may be overestimated due to the influence of noise etc. It is a method proposed because it has sex and is not good. I feel that learning efficiency will increase if DoubleDQN is used.
enable_rescaling Specifies whether to use the rescaling function (https://qiita.com/pocokhc/items/408f0f818140924ad4c4#rescaling-%E9%96%A2%E6%95%B0) for the reward. When the rescaling function is used, the reward is rounded to some extent, so the learning blurring due to the reward is suppressed.
rescaling_epsilon A constant used in the rescaling function. It seems to be a constant that prevents it from becoming 0, and I think it should be a value close to 0. (0.001 is the number used in the paper)
priority_exponent [Calculation of Priority (experience priority)] used in R2D2 equivalent LSTM learning (LSTMFUL) (https://qiita.com/pocokhc/items/3b64d747a2f36da559c3#priority%E3%81%AE%E8%A8% 88% E7% AE% 97% E6% 96% B9% E6% B3% 95). LSTMFUL determines the final Priority (experience priority) based on multiple Priority. The calculation method is the maximum value of $ Priority + the average value of Priority $. The ratio of how much this maximum value and average value are reflected is priority_exponent. In the case of 0.9, the maximum value of $ Priority * 0.9 + the average value of Priority * 0.1 $. In the paper, it was written that about 0.9 was a good result.
burnin_length This is the number of Burn-in used in LSTM learning (LSTMFUL) equivalent to R2D2. Roughly speaking, there is a difference between the past state (when empirical data is stored) and the current state in LST MFUL. Therefore, it is a method to set a period for flowing experience data without learning in order to get closer to the current state before learning. Increasing burnin_length will make learning more accurate, but will increase the cost of learning.
reward_multisteps It is the number of steps in Multi-Step learning. Normally, the reward for 1 step is used, but the reward for n-step is used. It feels like learning with a little future reward in mind (?) 3step is the value used in the paper.

Action relationship

	Overview	Mold	Example	Remarks
action_interval	Action execution interval	int	1	1 or more
action_policy	Measures to use in action execution	Policy(Original implementation)	See below

action_interval The update interval of the action. For example, if you set it to 4, the action will be updated every 4 frames. (Performs the same action while not updated)
action_policy Specifies the strategy to perform the action. For details on each measure, refer to Previous article.

ε-greedy ε-greedy acts randomly if it is less than $ epsilon $ against a random number (0.0 to 1.0), If it is larger than that, select the action that maximizes the Q value.

EpsilonGreedy(
    epsilon
)

ε-greedy(Annealing) [DQN](https://qiita.com/pocokhc/items/125479c9ae0df1de4234#%E3%82%A2%E3%82%AF%E3%82%B7%E3%83%A7%E3%83%B3%E3 The method used in% 81% AE% E6% B1% BA% E5% AE% 9A). It is a method to lower $ epsilon $ in ε-greedy (according to Q value) as learning progresses.

AnnealingEpsilonGreedy(
    initial_epsilon=1,
    final_epsilon=0.1,
    exploration_steps=1_000_000
)

initial_epsilon Initially $ epsilon $.
final_epsilon The final state is $ epsilon $.
exploration_steps Specify the number of steps from the initial state to the final state.

ε-greedy(Actor) Method used in Ape-X is. $ Epsilon $ in ε-greedy is calculated based on the number of Actors.

EpsilonGreedyActor(
    actor_index,
    actors_length,
    epsilon=0.4,
    alpha=7
)

actor_index Specify the index of the actor.
actors_length The total number of actors.
epsilon Specify the reference $ epsilon $.
alpha A constant used in the calculation.

Softmax It is a method to determine the action by the probability distribution of the Softmax function of the Q value. In short, the higher the Q value, the easier it is to be selected, and the lower the Q value, the harder it is to be selected.

SoftmaxPolicy()

There are no arguments.

UCB(Upper Confidence Bound)1 UCB1 is a method of selecting an action by considering not only the Q value but also the number of times the action was selected. The idea is to search for actions that are not selected very much because the search is not so advanced and there may be unknown rewards.

UCB1()