This is a hyperparameter explanation. I summarized each parameter.
Click here for the algorithm of the contents [Reinforcement learning] R2D2 implementation / explanation revenge commentary (Keras-RL)
The target code for this article is on github.
This parameter is common to Rainbow (DQN) and R2D2.
Overview | Mold | Example | Remarks | |
---|---|---|---|---|
input_shape | Input shape | tuple | (84,84) | env.observation_space.shape |
input_type | Specify input format | InputType | InputType.GRAY_2ch | Original implementation |
image_model | Image layer model format | ImageModel(Original implementation) | DQNImageModel() | |
nb_actions | Number of actions(Number of outputs) | int | 4 | env.action_space.n |
processor | Classes that provide custom gym functionality | Processor(Keras-rl) | None |
input_shape Specify the input format with tuple. If it's an image, it's in (width, height) format. In the case of Gym, it is the format that can be obtained by env.observation_space.shape.
input_type It is a specification that supplements the above input_shape. It is implemented independently. The following four types are assumed, and specify according to the contents of input_shape.
InputType
class InputType(enum.Enum):
VALUES = 1 #No image
GRAY_2ch = 3 # (width, height)
GRAY_3ch = 4 # (width, height, 1)
COLOR = 5 # (width, height, ch)
image_model This is the content explained in Previous article. However, there are currently only two types. If it is not an image, specify None, and if it is an image, specify DQNImageModel ().
nb_actions Specify the output format as int. This corresponds to the number of actions the agent can select. For example, if you stop left and right, there will be 3 actions if there are 3 operation methods. For Gym, you can get it in env.action_space.n. (Discrete format only)
processor A class that provides customization functionality for Env provided by Gym. (processor (Keras-rl official))
Overview | Mold | Example | Remarks | |
---|---|---|---|---|
batch_size | Batch size | int | 32 | |
optimizer | Optimization algorithm | Optimizer(Keras) | Adam(lr=0.0001) | Keras implementation |
metrics | Evaluation function | array | [] | Keras implementation |
input_sequence | Number of input frames | int | 4 | |
dense_units_num | Number of units in the Dense layer | int | 512 | |
enable_dueling_network | Whether to use Dueling Network | bool | True | |
dueling_network_type | Algorithm used in Dueling Network | DuelingNetwork | DuelingNetwork.AVERAGE | |
lstm_type | Types when using LSTM | LstmType(Original implementation) | LstmType.NONE | |
lstm_units_num | Number of LSTM layer units | int | 512 | |
lstm_ful_input_length | Number of input learnings per learning | int | 4 | Used only for STATEFUL |
batch_size The batch size used for mini-batch learning. (For batch, here (Keras official)) It is said that increasing the batch size increases learning efficiency and learning speed. However, unlike supervised learning, reinforcement learning has no restrictions on learning data, so increasing the batch size increases the learning cost (convergence speed increases in one learning, but it takes time, so the number of times to search for a new experience (Reduces), so I think it is better not to increase too much. Also, the batch size should be 2 ^ n.
optimizer Specifies the Keras Optimizer that you specify when you compile the NN model. For details, refer to How to use the optimizer (optimizer) (Keras official).
metrics Specifies the Keras merit function. I don't know much because I haven't used it ... For details, refer to How to use evaluation function (Keras formula).
input_sequence (formerly window_length) The number of observations to use for input. 1 uses only the latest 1 frame as input, and 4 uses the latest to 4 frames as input. Increasing this value will increase the expressiveness of the input, but will increase the learning cost.
dense_units_num The number of units in the Dense layer. Increasing this value will increase the expressiveness of the NN, but will increase the learning cost.
enable_dueling_network Whether to enable Dueling Network. Dueling Network aims to improve learning efficiency by letting NN learn by separating states and actions.
dueling_network_type This is the algorithm used to separate states and actions in DuelingNetwork. You can specify from the following three types. The paper stated that Average gave the best results.
DuelingNetwork
class DuelingNetwork(enum.Enum):
AVERAGE = 0
MAX = 1
NAIVE = 2
LstmType
class LstmType(enum.Enum):
NONE = 0
STATELESS = 1
STATEFUL = 2
Overview | Mold | Example | Remarks | |
---|---|---|---|---|
memory/remote_memory | Memory to use | Memory(Original implementation) | ReplayMemory(10000) | See below |
Specifies the type of memory to store experience. DQN stores the experienced data once in memory. After that, the experience is randomly extracted from the memory and learned. There are several types depending on how to retrieve from memory, so I will explain them.
ReplayMemory This is the simple memory used by DQN. (Previous article) Randomly retrieve empirical data.
ReplayMemory(
capacity=10_000
)
PERGreedyMemory [Priorityed Experience Playback](https://qiita.com/pocokhc/items/fc00f8ea9dca8f8c0297#priority-experience-reply%E5%84%AA%E5%85%88%E9%A0%86%E4%BD% 8D% E4% BB% 98% E3% 81% 8D% E7% B5% 8C% E9% A8% 93% E5% 86% 8D% E7% 94% 9F) is a straightforward implementation. This is a method to extract the experience with the largest TD error (the highest reflection rate in learning) rather than random. However, since there are no random elements, I feel like I'm going into a local solution right away, so I can't learn well ... (Why implemented)
PERGreedyMemory(
capacity=10_000
)
PERProportionalMemory [Priorityed Experience Playback](https://qiita.com/pocokhc/items/fc00f8ea9dca8f8c0297#priority-experience-reply%E5%84%AA%E5%85%88%E9%A0%86%E4%BD% 8D% E4% BB% 98% E3% 81% 8D% E7% B5% 8C% E9% A8% 93% E5% 86% 8D% E7% 94% 9F) Proportional Prioritization memory .. It is a method to extract experience according to the probability distribution of TD error instead of random. (Experience with more TD error has a higher probability of being extracted)
It feels much more efficient than Replay Memory (random selection).
PERGreedyMemory(
capacity=100000,
alpha=0.9,
beta_initial,
beta_steps,
enable_is,
)
The parameters will be described later.
PERRankBaseMemory [Priorityed Experience Playback](https://qiita.com/pocokhc/items/fc00f8ea9dca8f8c0297#priority-experience-reply%E5%84%AA%E5%85%88%E9%A0%86%E4%BD% RankBase memory in 8D% E4% BB% 98% E3% 81% 8D% E7% B5% 8C% E9% A8% 93% E5% 86% 8D% E7% 94% 9F). Experience is extracted in proportion to the order of TD error rather than random. For example, if you have three experiences, the first place is 50%, the second place is 33%, and the third place is 17%.
It feels much more efficient than Replay Memory (random selection), I don't really understand the difference from Proportional. This should be a little faster in terms of speed ...
PERRankBaseMemory(
capacity=100000,
alpha=0.9,
beta_initial,
beta_steps,
enable_is,
)
The parameters will be described later.
Overview | Mold | Example | Remarks | |
---|---|---|---|---|
capacity | Maximum capacity to store in memory | int | 1_000_000 | |
alpha | Probability reflection rate | float | 0.9 | 0.0~1.0 |
beta_initial | Initial value of IS reflection rate | float | 0.0 | 0.0~1.0 |
beta_steps | IS reflection rate 1.Number of steps to 0 | int | 100_000 | Depends on the number of learnings |
enable_is | Whether to enable IS | bool | True |
capacity Maximum capacity to save in memory.
alpha Reflection rate of Priority / RankBase. (0.0-1.0) 0.0 is completely random (similar to Replay Memory), 1.0 is completely obeyed by the probability distribution.
Here, Importance Sampling (IS) (https://qiita.com/pocokhc/items/fc00f8ea9dca8f8c0297#%E9%87%8D%E8%A6%81%E5%BA%A6%E3%82%B5% E3% 83% B3% E3% 83% 97% E3% 83% AA% E3% 83% B3% E3% 82% B0 is-importance-sampling). When experiencing experiences according to a probability distribution, there is a bias in the number of times each experience is selected. If the number of experience selections is biased, learning will be biased, and it is important to avoid this in importance sampling.
Specifically, experiences selected with a high probability have a low reflection rate in the update of the Q value, and experiences selected with a low probability have a high reflection rate in the update of the Q value.
It seems that learning will be stable by introducing IS. In addition, IS is annealing (gradually reflecting).
beta_initial This is the initial value of the IS reflection rate. (0.0 does not use IS, 1.0 reflects IS)
beta_steps This is the number of steps required to set the IS reflection rate to 1.0. Please specify based on the number of learnings.
enable_is Whether to enable IS.
Overview | Mold | Example | Remarks | |
---|---|---|---|---|
memory_warmup_size/ remote_memory_warmup_size | Size that does not learn until experience is accumulated in memory | int | 1000 | |
target_model_update | Update interval to Target model | int | 10000 | |
gamma | Q-learning discount rate | float | 0.99 | 0.0~1.0 |
enable_double_dqn | Whether to use DoubleDQN | bool | True | |
enable_rescaling | Whether to use the rescaling function | bool | True | |
rescaling_epsilon | Constants used in the rescaling function | float | 0.001 | |
priority_exponent | Ratio when calculating experience priorities | float | 0.9 | Use only LESTFUL |
burnin_length | burn-in period | int | 2 | Use only LESTFUL |
reward_multisteps | Number of steps in MultiStep Reward | int | 3 |
memory_warmup_size / remote_memory_warmup_size In the initial state, there is no experience in memory and learning is not possible. Therefore, make a period of not learning until the experience is accumulated in the memory. Specify the period here. I think a value greater than or equal to batch_size and not too small is good. (If you reduce it too much, you may fall into a local solution if there is biased empirical data at the beginning)
target_model_update The update interval of Target Network in DQN. DQN updates using a Q network dedicated to updating called Target Network. TargetNetwork does not learn and copies the current Q network at regular intervals. By doing this, a time difference will be created in the Q network used for updating, and updating will be better.
gamma [Q-learning discount rate](https://qiita.com/pocokhc/items/8ed40be84a144b28180d#q%E5%AD%A6%E7%BF%92%E3%81%AB%E3%81%A4%E3% 81% 84% E3% 81% A6). Specifies how much the reward is propagated. Well, I think a value close to 1.0 is fine.
enable_double_dqn DoubleDQN was learning by selecting the maximum Q value at the time of learning, but it may be overestimated due to the influence of noise etc. It is a method proposed because it has sex and is not good. I feel that learning efficiency will increase if DoubleDQN is used.
enable_rescaling Specifies whether to use the rescaling function (https://qiita.com/pocokhc/items/408f0f818140924ad4c4#rescaling-%E9%96%A2%E6%95%B0) for the reward. When the rescaling function is used, the reward is rounded to some extent, so the learning blurring due to the reward is suppressed.
rescaling_epsilon A constant used in the rescaling function. It seems to be a constant that prevents it from becoming 0, and I think it should be a value close to 0. (0.001 is the number used in the paper)
priority_exponent [Calculation of Priority (experience priority)] used in R2D2 equivalent LSTM learning (LSTMFUL) (https://qiita.com/pocokhc/items/3b64d747a2f36da559c3#priority%E3%81%AE%E8%A8% 88% E7% AE% 97% E6% 96% B9% E6% B3% 95). LSTMFUL determines the final Priority (experience priority) based on multiple Priority. The calculation method is the maximum value of $ Priority + the average value of Priority $. The ratio of how much this maximum value and average value are reflected is priority_exponent. In the case of 0.9, the maximum value of $ Priority * 0.9 + the average value of Priority * 0.1 $. In the paper, it was written that about 0.9 was a good result.
burnin_length This is the number of Burn-in used in LSTM learning (LSTMFUL) equivalent to R2D2. Roughly speaking, there is a difference between the past state (when empirical data is stored) and the current state in LST MFUL. Therefore, it is a method to set a period for flowing experience data without learning in order to get closer to the current state before learning. Increasing burnin_length will make learning more accurate, but will increase the cost of learning.
reward_multisteps It is the number of steps in Multi-Step learning. Normally, the reward for 1 step is used, but the reward for n-step is used. It feels like learning with a little future reward in mind (?) 3step is the value used in the paper.
Overview | Mold | Example | Remarks | |
---|---|---|---|---|
action_interval | Action execution interval | int | 1 | 1 or more |
action_policy | Measures to use in action execution | Policy(Original implementation) | See below |
action_interval The update interval of the action. For example, if you set it to 4, the action will be updated every 4 frames. (Performs the same action while not updated)
action_policy Specifies the strategy to perform the action. For details on each measure, refer to Previous article.
ε-greedy ε-greedy acts randomly if it is less than $ epsilon $ against a random number (0.0 to 1.0), If it is larger than that, select the action that maximizes the Q value.
EpsilonGreedy(
epsilon
)
ε-greedy(Annealing) [DQN](https://qiita.com/pocokhc/items/125479c9ae0df1de4234#%E3%82%A2%E3%82%AF%E3%82%B7%E3%83%A7%E3%83%B3%E3 The method used in% 81% AE% E6% B1% BA% E5% AE% 9A). It is a method to lower $ epsilon $ in ε-greedy (according to Q value) as learning progresses.
AnnealingEpsilonGreedy(
initial_epsilon=1,
final_epsilon=0.1,
exploration_steps=1_000_000
)
initial_epsilon Initially $ epsilon $.
final_epsilon The final state is $ epsilon $.
exploration_steps Specify the number of steps from the initial state to the final state.
ε-greedy(Actor) Method used in Ape-X is. $ Epsilon $ in ε-greedy is calculated based on the number of Actors.
EpsilonGreedyActor(
actor_index,
actors_length,
epsilon=0.4,
alpha=7
)
actor_index Specify the index of the actor.
actors_length The total number of actors.
epsilon Specify the reference $ epsilon $.
alpha A constant used in the calculation.
Softmax It is a method to determine the action by the probability distribution of the Softmax function of the Q value. In short, the higher the Q value, the easier it is to be selected, and the lower the Q value, the harder it is to be selected.
SoftmaxPolicy()
There are no arguments.
UCB(Upper Confidence Bound)1 UCB1 is a method of selecting an action by considering not only the Q value but also the number of times the action was selected. The idea is to search for actions that are not selected very much because the search is not so advanced and there may be unknown rewards.
UCB1()
There are no arguments. In addition, the learning cost increases because the NN model is held and trained inside.
UCB1-Tuned UCB1-Tuned is an improved algorithm for UCB1 that also considers variance. It gives better results than UCB1, but there is no theoretical guarantee.
UCB1_Tuned()
There are no arguments. In addition, the learning cost increases because the NN model is held and trained inside.
UCB-V It is an algorithm that is more distributed-conscious than UCB1-Tuned.
UCBv()
There are no arguments. In addition, the learning cost increases because the NN model is held and trained inside.
KL-UCB It is an algorithm that finds the theoretical optimum value of the search and reward dilemma. However, the implementation may be a little strange ...
KL_UCB()
There are no arguments. In addition, the learning cost increases because the NN model is held and trained inside.
Thompson Sampling is an algorithm based on Bayesian inference. This is also the theoretical optimum for the search and reward dilemma.
The beta distribution is a distribution that can be applied when it takes a binary value of 0 or 1. In the implementation, if the reward is greater than 0, it is treated as 1, and if it is 0 or less, it is treated as 0.
ThompsonSamplingBeta()
There are no arguments. In addition, the learning cost increases because the NN model is held and trained inside.
Thompson Sampling is an algorithm based on Bayesian inference. This is also the theoretical optimum for the search and reward dilemma.
The algorithm is applied assuming that the reward follows a normal distribution.
ThompsonSamplingGaussian()
There are no arguments. In addition, the learning cost increases because the NN model is held and trained inside.
Overview | Mold | Example | Remarks | |
---|---|---|---|---|
train_interval | Learning interval | int | 1 | 1 or more |
By increasing train_interval, you can increase the learning interval.
Overview | Mold | Example | Remarks | |
---|---|---|---|---|
actors | Specify Actor class | Actor(Original implementation) | See below | |
actor_model_sync_interval | Interval to synchronize NN model from Learner | int | 500 |
Actor It is a class that expresses Actor by its own implementation. It inherits this and defines the Policy and env.fit that each Actor executes.
This is a definition example.
from src.r2d2 import Actor
from src.policy import EpsilonGreedy
ENV_NAME = "xxx"
class MyActor(Actor):
def getPolicy(self, actor_index, actor_num):
return EpsilonGreedy(0.1)
def fit(self, index, agent):
env = gym.make(ENV_NAME)
agent.fit(env, visualize=False, verbose=0)
env.close()
getPolicy specifies the action policy used by the actor. Let agetn, which is an argument in fit, execute fit to learn.
Be careful when passing it to R2D2, pass the class itself (don't instantiate it)
from src.r2d2 import R2D2
kwargs = {
"actors": [MyActor] #Pass the class itself
(abridgement)
}
manager = R2D2(**kwargs)
If you want to increase the number of Actors, increase the number of elements in the array.
Example of 4 Actors
from src.r2d2 import R2D2
kwargs = {
"actors": [MyActor, MyActor, MyActor, MyActor]
(abridgement)
}
manager = R2D2(**kwargs)
MovieLogger(Rainbow/R2D2)
Callback that outputs a video. It can be used with both Rainbow and R2D2.
from src.callbacks import MovieLogger
#Add it to the callcacks argument of test.
movie = MovieLogger()
agent.test(env, nb_episodes=1, visualize=False, callbacks=[movie])
#you save.
movie.save(
start_frame=0,
end_frame=0,
gifname="pendulum.gif",
mp4name="",
interval=200,
fps=30
):
start_frame Specify the start frame.
end_frame Specify the end frame. If it is 0, all frames will be targeted until the end.
gifname This is the path when outputting in gif format. Save as matplotlib animation. If it is "", it will not be output.
mp4name This is the path when outputting in mp4 format. Save as matplotlib animation. If it is "", it will not be output.
interval An interval to pass to FuncAnimation in matplotlib.
fps Fps when saving a video with matplotlib.
・ Output example
Callback for Conv layer, Advance layer, and Value layer visualization introduced in Previous article. It can be used with both Rainbow and R2D2.
from src.callbacks import ConvLayerView
#Specify the agent in the initialization.
conv = ConvLayerView(agent)
#Perform a test.
#Specify ConvLayerView object in callbacks argument
agent.test(env, nb_episodes=1, visualize=False, callbacks=[conv])
#Save the result.
conv.save(
grad_cam_layers=["conv_1", "conv_2", "conv_3"],
add_adv_layer=True,
add_val_layer=True,
start_frame=0,
end_frame=200,
gifname="tmp/pendulum.gif",
interval=200,
fps=10,
)
grad_cam_layers Specify the target Conv layer. The name will be the name specified in ImageModel.
add_adv_layer Whether to add an Advance layer
add_val_layer Whether to add a Value layer
start_frame Specify the start frame.
end_frame Specify the end frame. If it is 0, all frames will be targeted until the end.
gifname This is the path when outputting in gif format. Save as matplotlib animation. If it is "", it will not be output.
mp4name This is the path when outputting in mp4 format. Save as matplotlib animation. If it is "", it will not be output.
interval An interval to pass to FuncAnimation in matplotlib.
fps Fps when saving a video with matplotlib.
Also, ConvLayerView works only when the input is an image (InputType is GRAY_2ch, GRAY_3ch, COLOR).
・ Output example
Logger2Stage(Rainbow) It provides the following two functions.
from src.rainbow import Rainbow
from src.callbacks import Logger2Stage
#Create a separate agent and env for testing
kwargs = (abridgement)
test_agent = Rainbow(**kwargs)
test_env = gym.make(ENV_NAME)
#various settings
log = Logger2Stage(
logger_type=LoggerType.STEP,
warmup=1000,
interval1=200,
interval2=20_000,
change_count=5,
savefile="tmp/log.json",
test_agent=test_agent,
test_env=test_env,
test_episodes=10
)
#Add to callbacks when learning
#Logger2Stage outputs the log, so verbose=0
agent.fit(env, nb_steps=1_750_000, visualize=False, verbose=0, callbacks=[log])
#You can get the logs with the getLogs function(You must specify savefile)
history = log.getLogs()
#It's simple, but you can also output a graph(You must specify savefile)
log.drawGraph()
logger_type The log recording format. LoggerType.TIME: Get in time. LoggerType.STEP: Get by the number of steps.
warmup No logs are retrieved during the first warmup time. LoggerType.TIME is the number of seconds, and LoggerType.STEP is the number of steps.
interval1 This is the first log acquisition interval. LoggerType.TIME is the number of seconds, and LoggerType.STEP is the number of steps.
interval2 This is the second stage log acquisition interval. LoggerType.TIME is the number of seconds, and LoggerType.STEP is the number of steps.
change_count The number of transitions from the first stage to the second stage. When the first stage gets this number of logs, it moves to the second stage.
savefile This is the file that saves the log.
test_agent Specify if you want to test separately from the learning environment. If None, only the result of the learning environment will be output.
test_env Specify if you want to test separately from the learning environment. If None, only the result of the learning environment will be output.
test_episodes The number of episodes in the test environment.
・ Output example
--- start ---
'Ctrl + C' is stop.
Steps 0, Time: 0.00m, TestReward: 21.12 - 92.80 (ave: 51.73, med: 46.99), Reward: 0.00 - 0.00 (ave: 0.00, med: 0.00)
Steps 200, Time: 0.05m, TestReward: 22.06 - 99.94 (ave: 43.85, med: 31.24), Reward: 108.30 - 108.30 (ave: 108.30, med: 108.30)
Steps 1200, Time: 0.28m, TestReward: 40.99 - 73.88 (ave: 52.41, med: 47.69), Reward: 49.05 - 141.53 (ave: 87.85, med: 90.89)
(abridgement)
Steps 17200, Time: 3.95m, TestReward: 167.68 - 199.49 (ave: 184.34, med: 188.30), Reward: 166.29 - 199.66 (ave: 181.79, med: 177.36)
Steps 18200, Time: 4.19m, TestReward: 165.84 - 199.53 (ave: 186.16, med: 188.50), Reward: 188.00 - 199.50 (ave: 190.64, med: 188.41)
Steps 19200, Time: 4.43m, TestReward: 163.63 - 188.93 (ave: 186.15, med: 188.59), Reward: 165.56 - 188.45 (ave: 183.75, med: 188.23)
done, took 4.626 minutes
Steps 0, Time: 4.63m, TestReward: 188.37 - 199.66 (ave: 190.83, med: 188.68), Reward: 188.34 - 188.83 (ave: 188.63, med: 188.67)
SaveManager(R2D2) R2D2 uses multiprocessing and the implementation method is quite special. In particular, the save / load of the model was significantly affected, so I prepared it separately.
from src.r2d2 import R2D2
from src.r2d2_callbacks import SaveManager
#Creating R2D2
kwargs = (abridgement)
manager = R2D2(**kwargs)
#Creating a SaveManager
save_manager = SaveManager(
save_dirpath="tmp",
is_load=False,
save_overwrite=True,
save_memory=True,
checkpoint=True,
checkpoint_interval=2000,
verbose=0
)
#Start learning, add to callbacks argument.
manager.train(
nb_trains=20_000,
callbacks=[save_manager],
)
#Call the following to create an Agent for test
# save_dirpath/last/learner.Please specify dat.
agent = manager.createTestAgent(MyActor, "tmp/last/learner.dat")
#Conduct a test.
agent.test(env, nb_episodes=5, visualize=True)
save_dirpath The directory to save the results. Since a directory for checkpoints will be created under the directory, it is in the directory format.
is_load Whether to load previous learning results
save_overwrite Whether to overwrite the saved result
save_memory Do you want to save the contents of Reply Memory as well? If you save it, you can resume learning from the exact same situation as last time, but the file size of the memory is large (a few GB). Also, since it is saved separately as a .mem file, it can be deleted later.
checkpoint Whether to save the progress
checkpoint_interval This is an interval for saving the progress. The unit is the number of learnings of Learner.
verbose If it is 0, no print is output. If it is 1, there is a print output.
Logger2Stage(R2D2)
It provides the following two functions.
Also, unlike rainbow, there is only an acquisition interval in time.
from src.r2d2 import R2D2
from src.r2d2_callbacks import Logger2Stage
#Creating R2D2
kwargs = (abridgement)
manager = R2D2(**kwargs)
#Create env for testing
test_env = gym.make(ENV_NAME)
#Create Logger2 Stage
log = Logger2Stage(
warmup=0,
interval1=10,
interval2=60,
change_count=20,
savedir="tmp",
test_actor=MyActor,
test_env=test_env,
test_episodes=10,
verbose=1,
)
#Start learning, add to callbacks argument.
manager.train(
nb_trains=20_000,
callbacks=[log],
)
#You can get the logs with getLogs.(If savedir is specified)
history = log.getLogs()
#You can also easily display a graph.(If savedir is specified)
log.drawGraph()
warmup It is the time to start the acquisition for the first time. (Seconds)
interval1 This is the first log acquisition interval. (Seconds)
interval2 This is the second stage log acquisition interval. (Seconds)
change_count The number of transitions from the first stage to the second stage. When the first stage gets this number of logs, it moves to the second stage.
savedir The directory to save the log. There are separate processes for Learner and Actor, and each process separates files to store values and avoid conflicts.
test_actor Specifies the Actor class to use during testing. If None, no test will be performed.
test_env Specify if you want to test separately from the learning environment. If None, no test will be performed.
test_episodes The number of episodes in the test environment.
・ Output example
--- start ---
'Ctrl + C' is stop.
Learner Start!
Actor0 Start!
Actor1 Start!
actor1 Train 1, Time: 0.24m, Reward : 27.80 - 27.80 (ave: 27.80, med: 27.80), nb_steps: 200
learner Train 1, Time: 0.19m, TestReward: 29.79 - 76.71 (ave: 58.99, med: 57.61)
actor0 Train 575, Time: 0.35m, Reward : 24.88 - 133.09 (ave: 62.14, med: 50.83), nb_steps: 3400
learner Train 651, Time: 0.36m, TestReward: 24.98 - 51.67 (ave: 38.86, med: 38.11)
actor1 Train 651, Time: 0.41m, Reward : 22.15 - 88.59 (ave: 41.14, med: 35.62), nb_steps: 3200
actor0 Train 1249, Time: 0.51m, Reward : 22.97 - 61.41 (ave: 35.24, med: 31.99), nb_steps: 8000
(abridgement)
learner Train 16476, Time: 4.53m, TestReward: 165.56 - 199.57 (ave: 180.52, med: 177.73)
actor1 Train 16880, Time: 4.67m, Reward : 128.88 - 188.45 (ave: 169.13, med: 165.94), nb_steps: 117600
Learning End. Train Count:20001
learner Train 20001, Time: 5.29m, TestReward: 175.72 - 188.17 (ave: 183.21, med: 187.48)
Actor0 End!
Actor1 End!
actor0 Train 20001, Time: 5.34m, Reward : 151.92 - 199.61 (ave: 181.68, med: 187.48), nb_steps: 0
actor1 Train 20001, Time: 5.34m, Reward : 130.39 - 199.26 (ave: 170.83, med: 167.99), nb_steps: 0
done, took 5.350 minutes
from src.rainbow import Rainbow
from src.processor import AtariProcessor
from src.image_model import DQNImageModel
from src.memory import ReplayMemory
from src.policy import AnnealingEpsilonGreedy
nb_steps = 1_750_000
#What AtariProcessor does
#・ Resize image(84,84)
#・ Reward clipping
processor = AtariProcessor(reshape_size=(84, 84), is_clip=True)
kwargs={
"input_shape": processor.image_shape,
"input_type": InputType.GRAY_2ch,
"nb_actions": env.action_space.n,
"optimizer": Adam(lr=0.0001),
"metrics": [],
"image_model": DQNImageModel(),
"input_sequence": 4, #Number of input frames
"dense_units_num": 256, #Number of units in the dense layer
"enable_dueling_network": False,
"lstm_type": LstmType.NONE, #LSTM algorithm to use
# train/action related
"memory_warmup_size": 50_000, #Number of steps for initial memory allocation(Don't learn)
"target_model_update": 10_000, #target network update interval
"action_interval": 4, #Interval to perform action
"train_interval": 4, #Interval to learn
"batch_size": 32, # batch_size
"gamma": 0.99, #Q-learning discount rate
"enable_double_dqn": False,
"enable_rescaling": False, #Whether to enable rescaling
"reward_multisteps": 1, # multistep reward
#Other
"processor": processor,
"action_policy": AnnealingEpsilonGreedy(
initial_epsilon=1.0, #Initial ε
final_epsilon=0.05, #Ε in final state
exploration_steps=1_000_000 #Number of steps from initial to final state
),
"memory": ReplayMemory(capacity=1_000_000),
}
agent = Rainbow(**kwargs)
from src.rainbow import Rainbow
from src.memory import ReplayMemory
from src.policy import SoftmaxPolicy
env = gym.make('CartPole-v0')
kwargs={
"input_shape": env.observation_space.shape,
"input_type": InputType.VALUES,
"nb_actions": env.action_space.n,
"optimizer": Adam(lr=0.0001),
"metrics": [],
"image_model": None,
"input_sequence": 1, #Number of input frames
"dense_units_num": 16, #Number of units in the dense layer
"enable_dueling_network": False,
"lstm_type": LstmType.NONE,
# train/action related
"memory_warmup_size": 10, #Number of steps for initial memory allocation(Don't learn)
"target_model_update": 1, #target network update interval
"action_interval": 1, #Interval to perform action
"train_interval": 1, #Interval to learn
"batch_size": 32, # batch_size
"gamma": 0.99, #Q-learning discount rate
"enable_double_dqn": False,
"enable_rescaling": False,
#Other
"processor": processor,
"action_policy": SoftmaxPolicy(),
"memory": ReplayMemory(capacity=50000)
}
agent = Rainbow(**kwargs)
from src.rainbow import Rainbow
from src.processor import AtariProcessor
from src.image_model import DQNImageModel
from src.memory import PERProportionalMemory
from src.policy import AnnealingEpsilonGreedy
nb_steps = 1_750_000
#What AtariProcessor does
#・ Resize image(84,84)
#・ Reward clipping
processor = AtariProcessor(reshape_size=(84, 84), is_clip=True)
kwargs={
"input_shape": processor.image_shape,
"input_type": InputType.GRAY_2ch,
"nb_actions": env.action_space.n,
"optimizer": Adam(lr=0.0000625, epsilon=0.00015),
"metrics": [],
"image_model": DQNImageModel(),
"input_sequence": 4, #Number of input frames
"dense_units_num": 512, #Number of units in the dense layer
"enable_dueling_network": True,
"dueling_network_type": DuelingNetwork.AVERAGE, #Algorithm used in dueling network
"lstm_type": LstmType.NONE,
# train/action related
"memory_warmup_size": 80000, #Number of steps for initial memory allocation(Don't learn)
"target_model_update": 32000, #target network update interval
"action_interval": 4, #Interval to perform action
"train_interval": 4, #Interval to learn
"batch_size": 32, # batch_size
"gamma": 0.99, #Q-learning discount rate
"enable_double_dqn": True,
"enable_rescaling": False,
"reward_multisteps": 3, # multistep reward
#Other
"processor": processor,
"action_policy": AnnealingEpsilonGreedy(
initial_epsilon=1.0, #Initial ε
final_epsilon=0.05, #Ε in final state
exploration_steps=1_000_000 #Number of steps from initial to final state
),
"memory": PERProportionalMemory(
capacity=1_000_000,
alpha=0.5, #Probability reflection rate of PER
beta_initial=0.4, #Initial value of IS reflection rate
beta_steps=1_000_000, #Number of steps to increase IS reflection rate
enable_is=True, #Whether to enable IS
)
}
agent = Rainbow(**kwargs)
from src.r2d2 import R2D2, Actor
from src.processor import AtariProcessor
from src.image_model import DQNImageModel
from src.memory import PERProportionalMemory
from src.policy import EpsilonGreedyActor
ENV_NAME = "xxxxx"
class MyActor(Actor):
def getPolicy(self, actor_index, actor_num):
return EpsilonGreedyActor(actor_index, actor_num, epsilon=0.4, alpha=7)
def fit(self, index, agent):
env = gym.make(ENV_NAME)
agent.fit(env, visualize=False, verbose=0)
env.close()
#What AtariProcessor does
#・ Resize image(84,84)
#・ Reward clipping
processor = AtariProcessor(reshape_size=(84, 84), is_clip=True)
kwargs={
"input_shape": processor.image_shape,
"input_type": InputType.GRAY_2ch,
"nb_actions": env.action_space.n,
"optimizer": Adam(lr=0.0001, epsilon=0.001),
"metrics": [],
"image_model": DQNImageModel(),
"input_sequence": 4, #Number of input frames
"dense_units_num": 512, #Number of units in the Dense layer
"enable_dueling_network": True, # dueling_network valid flag
"dueling_network_type": DuelingNetwork.AVERAGE, # dueling_network algorithm
"lstm_type": LstmType.STATEFUL, #LSTM algorithm
"lstm_units_num": 512, #Number of LSTM layer units
"lstm_ful_input_length": 40, #Stateful LSTM inputs
# train/action related
"remote_memory_warmup_size": 50_000, #Number of steps for initial memory allocation(Don't learn)
"target_model_update": 10_000, #target network update interval
"action_interval": 4, #Interval to perform action
"batch_size": 64,
"gamma": 0.997, #Q-learning discount rate
"enable_double_dqn": True, #DDQN valid flag
"enable_rescaling": enable_rescaling, #Whether to enable rescaling(priotrity)
"rescaling_epsilon": 0.001, #rescaling constant
"priority_exponent": 0.9, #priority priority
"burnin_length": 40, # burn-in period
"reward_multisteps": 3, # multistep reward
#Other
"processor": processor,
"actors": [MyActor for _ in range(256)],
"remote_memory": PERProportionalMemory(
capacity= 1_000_000,
alpha=0.6, #Probability reflection rate of PER
beta_initial=0.4, #Initial value of IS reflection rate
beta_steps=1_000_000, #Number of steps to increase IS reflection rate
enable_is=True, #Whether to enable IS
),
#actor relationship
"actor_model_sync_interval": 400, #Interval to synchronize model from learner
}
manager = R2D2(**kwargs)
There are too many parameters ... It seems that something called R2D3 has been announced, and we will implement it soon.
Recommended Posts