I posted an article (https://zenn.dev/ymd_h/articles/03edcaa47a3b1c) on Zenn for the technical story of internal implementation. If you are interested, please read that.
I wrote Introduction article before, but I have developed and released a library of Replay Buffer for experience replay of reinforcement learning.
Although Experience Replay is widely used in reinforcement learning, many people are spending time reinventing the wheel by copying and rewriting the code published on the Internet etc. I think it's not good and continue to develop and publish. Moreover, it is surprisingly addictive, and it is a headache for researchers and developers who are interested in deep learning, such as inefficient implementation and encountering bugs that follow the same rut, so it can be used quickly. I think it's important to have a good library.
(Of course, there is a wonderful library that includes the whole reinforcement learning such as RLlib, but it is quite difficult for researchers of reinforcement learning algorithms to put it on their own, so it is easy to use. Also, DeepMind's Reverb, which was released the other day, is a direct competitor, but it is supposed to be on a larger scale, and this is currently a single computer. Is expected to be used.)
Since it deviates from the subject, I will not describe it in detail here, but I am developing it with a focus on deep learning framework-independent, high degree of freedom and efficiency. If you are interested, please use it and give us feedback such as starring or issuing an issue in the repository.
Distributed learning Ape-X has been proposed as one of the methods for learning reinforcement learning in a short time. Roughly speaking, it is a method of separating environment exploration and network learning and performing multiple explorations at the same time.
There was also a great article on Qiita with detailed explanations and implementations.
Even with these implementations, I think Replay Buffer and interprocess data communication are written in full scratch. When reusing, I think it's difficult to copy and rewrite all the related parts when it comes to "I'm not a TensorFlow group but a PyTorch group ..." I will.
This feature for Ape-X requires v9.4.2 or later.
3.1 Linux/Windows You can install the binaries directly from PyPI.
pip install cpprb
3.2 macOS
Unfortunately, clang, which is used by default, cannot be compiled, so it is necessary to have gcc prepared by Homebrew or MacPorts and compiled by hand at the time of installation.
Replace / path/to/g ++
with the path of the installed g ++.
CC=/path/to/g++ CXX=/path/to/g++ pip install cpprb
Reference: Installation procedure on the official website
Below is a sample implementation of Ape-X when using cpprb. This is the code for the skeleton part of Ape-X that does not include the model (network) part of deep learning. I think that you can actually use it by changing the MyModel
part of the mock and adding visualization such as TensorBoard or saving the model as needed. (I think that implementation around that is not difficult for those who are researching and developing reinforcement learning on a regular basis.)
apex.py
from multiprocessing import Process, Event, SimpleQueue
import time
import gym
import numpy as np
from tqdm import tqdm
from cpprb import ReplayBuffer, MPPrioritizedReplayBuffer
class MyModel:
def __init__(self):
self._weights = 0
def get_action(self,obs):
# Implement action selection
return 0
def abs_TD_error(self,sample):
# Implement absolute TD error
return np.zeros(sample["obs"].shape[0])
@property
def weights(self):
return self._weights
@weights.setter
def weights(self,w):
self._weights = w
def train(self,sample):
# Implement model update
pass
def explorer(global_rb,env_dict,is_training_done,queue):
local_buffer_size = int(1e+2)
local_rb = ReplayBuffer(local_buffer_size,env_dict)
model = MyModel()
env = gym.make("CartPole-v1")
obs = env.reset()
while not is_training_done.is_set():
if not queue.empty():
w = queue.get()
model.weights = w
action = model.get_action(obs)
next_obs, reward, done, _ = env.step(action)
local_rb.add(obs=obs,act=action,rew=reward,next_obs=next_obs,done=done)
if done:
local_rb.on_episode_end()
obs = env.reset()
else:
obs = next_obs
if local_rb.get_stored_size() == local_buffer_size:
local_sample = local_rb.get_all_transitions()
local_rb.clear()
absTD = model.abs_TD_error(local_sample)
global_rb.add(**local_sample,priorities=absTD)
def learner(global_rb,queues):
batch_size = 64
n_warmup = 100
n_training_step = int(1e+4)
explorer_update_freq = 100
model = MyModel()
while global_rb.get_stored_size() < n_warmup:
time.sleep(1)
for step in tqdm(range(n_training_step)):
sample = global_rb.sample(batch_size)
model.train(sample)
absTD = model.abs_TD_error(sample)
global_rb.update_priorities(sample["indexes"],absTD)
if step % explorer_update_freq == 0:
w = model.weights
for q in queues:
q.put(w)
if __name__ == "__main__":
buffer_size = int(1e+6)
env_dict = {"obs": {"shape": 4},
"act": {},
"rew": {},
"next_obs": {"shape": 4},
"done": {}}
n_explorer = 4
global_rb = MPPrioritizedReplayBuffer(buffer_size,env_dict)
is_training_done = Event()
is_training_done.clear()
qs = [SimpleQueue() for _ in range(n_explorer)]
ps = [Process(target=explorer,
args=[global_rb,env_dict,is_training_done,q])
for q in qs]
for p in ps:
p.start()
learner(global_rb,qs)
is_training_done.set()
for p in ps:
p.join()
print(global_rb.get_stored_size())
As you can see, the MPPrioritizedReplayBuffer
(Multi-Process supported Prioritized Replay Buffer), which is used as a global buffer, can be accessed from multiple processes without any special awareness. Since the internal data is stored in shared memory, interprocess data sharing is faster than proxies (multiprocessing.managers.SyncManager
etc.) and queues (multiprocessing.Queue
etc.).
Also, locks to prevent data inconsistencies are executed inside each method, so as long as the user adheres to the basic configuration of multiple explorer + single learner, there is no need to lock manually. .. Moreover, it does not lock the entire buffer, it locks only the minimum critical sections necessary to maintain data integrity, which is much more efficient than locking the entire global buffer poorly. (Especially in situations where the deep learning network is small or the computational load of the environment such as a simulator is light, the difference becomes large, and although it is not a rigorous test, it is a simple global when developing at hand. Compared to locking the entire buffer, the explorer speed is 3-4 times faster and the learner speed is 1.2-2 times faster.)
A friend of mine is using cpprb to develop a reinforcement learning library tf2rl for TensorFlow 2.x. If you are interested in that too, thank you.
Introductory article of the author himself → tf2rl: TensorFlow2 Reinforcement Learning
Recommended Posts